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Introduction 


This book represents a synergistic combination of stochastic processes, estimation, 
optimization and the analysis of recursive stochastic algorithms. Readers with vary- 
ing interests and mathematical backgrounds may easily approach our book. This 
book deals with powerful and convenient approaches to a great variety of problems. 
It represents a contribution to the field of applied probabilities and statistics. We 
have written this book from an engineering point of view. Our emphasis here is on 
the design of techniques that enable the reader to use the theory. 


The first chapter is devoted to stochastic processes. We present the foundations 
of probability theory: the conditional mathematical expectation, the notion of 
convergence for stochastic processes, different examples of stochastic processes 
related to different engineering areas, renewal processes, finite Markov chains and 
martingales, which have extensive applications in stochastic problems and arise nat- 
urally whenever one needs to consider mathematical expectations with respect to 
increasing information patterns. Many examples are given in order to illustrate, and 
to help readers deeply understand, the mathematical tools contained in this chapter. 
We have tried to make the contents easily accessible also for readers not familiar 
with mathematical works, and avoided using very complex mathematical tools. 
For example, many people are familiar with the classical concept of integrals in 
the Riemann sense but they know nothing about integration in the Lebesgue sense. 
We have avoided the use of measure theory. We have also introduced examples 
in order to explain physically the concepts that are useful in stochastic processes. 
Chapter 1 also exhibits another characteristic, as it contains renewal processes and 
finite Markov chains, which play an important role in the modelling and analysis 
of many engineering problems (control, reliability, maintenance, etc.). 


The second chapter contains the main probability distributions, their properties, 
and the algorithms used for probability densities estimation, which play an import- 
ant role in different areas (data analysis, modelling, reliability, insurance, etc.). 
They naturally arise as functions of random variables in many real (engineering, 
economic, biological, medicine, etc.) problems. In fact, real data (measurements) 
exhibit random characteristics. We explain how real random phenomena are dis- 
tributed according to such-and-such a probability distribution. We can calculate 
the probability that a random variable takes on values in any set of the real line, 
which is of interest. We give the existing relations between different distribution 
laws, and present a criterion for their classification. We show how to calculate the 
different probability characteristics (mean, variance, etc.). 


The estimation of probability distributions plays an important role in practice. Many 
techniques exist for this purpose. We first present the method of moments, followed 
by the kernel approach. The method of moments seems to be natural when a priori 
information is available and leads to the resolution of algebraic equations. We 
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focus our attention on three parametric approaches, namely, expectation maximiz- 
ation (EM), which is an iterative procedure for approximating maximum-likelihood 
estimates for mixture density problems, and its extension based on the total kurtosis 
(measure of heaviness of the tails of a distribution); the method based on neural 
networks; and, finally, an approach based on stochastic approximation techniques, 
which can handle problems related to estimation under constraints (physical, math- 
ematical, etc.) of some unknown parameters in the sense of some criterion. The 
emphasis here will be on numerical considerations pertinent to probability densities 
estimation. It is shown that a Gaussian mixture model is general and can approxim- 
ate any continuous function having a finite number of discontinuities. This result is 
not surprising because many physical processes follow the Gaussian law. In other 
words, the Gaussian law can be considered as the basic element of a "Lego" toy in 
the framework of a Gaussian mixture. We will treat also some other approximations 
ofa given probability density on the basis of polynomial (Hermite, etc.) expansions. 


The third chapter is dedicated to some optimization algorithms. Optimization tech- 
niques have been gaining greater acceptance in many industrial applications. This 
fact was motivated by the increased interest in improved economy in and better 
utilization of existing material resources. Euler says: *Nothing happens in the uni- 
verse that does not have a sense of either certain maximum or minimum." The 
main algorithms consist of an intelligent assemblage (arrangement) of a set of 
*simple" operations or tasks, which are repeated several times. Some of them are 
a mimic representation of real phenomena (genetic algorithms, simulated anneal- 
ing, optimization algorithms based on learning systems, etc.). From mathematical 
point of view, simulated annealing is a non-homogeneous Markovian optimization 
algorithm. We present also algorithms based on stochastic approximation tech- 
niques. Stochastic approximation techniques provide prototype models (see for 
example the estimation method presented in the previous chapter) and methodo- 
logy for optimization and adjustment in many areas (control, economy, reliability, 
etc.). All the optimization algorithms described in this chapter belong to the class 
of random search techniques, and as a consequence they are not doomed by local 
optima. We consider both unconstrained and constrained optimization problems. 


The last chapter presents a set of recursive stochastic algorithms and their ana- 
lysis. Typically, a sequence of estimates is obtained by means of some recursive 
statistica! procedure. The nth estimate is some function of the (n — 1)th estim- 
ate and of some new observational data, and the aim is to study convergence 
and other qualitative properties (convergence rate) of the algorithm. We have 
tried to bring out a methodology for the derivation of the asymptotic properties 
of recursive stochastic algorithms. In this chapter we show how to use directly 
results dealing with stochastic approximation techniques, well-known inequalit- 
ies, the Lyapunov approach and the martingale theory. In general, the problem of 
determining whether a given stochastic algorithm will converge or not can require 
a great deal of ingenuity and resourcefulness. We have tried to make this easier by 
presenting a methodology, the complete analysis of two recursive algorithms, and 
many direct applications of the standard inequalities (Cauchy, Jensen, Minkowski, 
Hadamard, inequalities based on vectors, matrices, determinants, etc.), well-known 
lemmas (Borel-Cantelli, Fatou, Kronecker, etc.) and theorems (Robbins-Monro, 
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Robbins-Siegmund, etc.). This chapter is divided into four parts. In the first part 
we present some parts of proofs of theorems in order to illustrate how to use dif- 
ferent inequalities, lemmas and theorems. In the second part we present in detail 
the analysis of a recursive stochastic algorithm (on the basis of a study carried out 
by the authors and P. Del Moral). The third part of this chapter is devoted to the 
analysis of another algorithm developed by A. S. Poznyak and the authors. The 
analysis of this algorithm is based on the use of the Lyapunov approach, martingale 
theory and different inequalities. Based on examples, we show why some assump- 
tions are made (usually some assumptions are introduced only in order to prove 
the convergence of a given algorithm, and in practice these assumptions can be 
relaxed). Finally, the last part of this chapter is dedicated to the analysis of a simple 
recursive scheme and a method for deriving the convergence rate. 


This book contains two appendices. The main inequalities, lemmas and theorems 
are collected in the first. We give the proofs of the more important tools. These 
proofs can also help the reader to understand the proofs of some results found in the 
literature or to develop the analysis of their own algorithms. The second appendix 
contains a set of Matlab programs ready for use. 


Each chapter can be read in one of two ways: either in detail to gain a full appre- 
ciation of the results, or selectively to get the flavor of the results and to establish 
notation. Our book exhibits another characteristic. In fact, it can be read in different 
ways: linearly, i.e., chapter by chapter, or nonlinearly, depending on the primary 
interest of the reader. Finally, our objective in writing this book was twofold. First, 
it was to present current and important tools in stochastic processes, estimation, 
optimization, and recursive algorithms analysis in a form accessible to engineers, 
yet without completely sacrificing mathematical rigor. Secondly, it was to gather 
together, in a unified presentation, many of the results in probability and statistics. 
We emphasize the applications of the tools contained in this book to real problems. 


Any book on advanced methods is predetermined to be incomplete. We have selec- 
ted a set of methods and approaches based on our own preferences, reflected by 
our experience — and, undoubtedly, lack of experience — with many of the modern 
approaches. Inspite of all this, we believe we have put together a solid package of 
material on relevant methods on stochastic processes, estimation, optimization, and 
analysis of recursive stochastic algorithms. We hope that this book will be valuable 
to students in automatic control, mechanical and electrical engineering, as well as 
all engineers dealing with stochastic processes. 


We would like to thank our friend and colleague Professor A. S. Poznyak for 
providing valuable comments on the manuscript. This book is partially based on 
lectures given at the University of Oulu, Finland. 


The work of E. Ikonen was supported by the Academy of Finland, project no 48545. 
Professor Kaddour Najim 


Docent Enso Ikonen 
Professor Daoud Ait-Kadi 
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Chapter 1 


Stochastic Processes 


1.1. Introduction 


In this chapter an introduction to the basic concepts and properties of stochastic 
processes is given. Stochastic processes play an important role in various fields 
(chemistry, process engineering, reliability, maintenance, biology, medicine, 
economy, insurance, etc.). Ermoliev and Wets [20] say: 


Systems involving interactions between man, nature and technology are subject to disturb- 
ance which may be unlike anything which has been experienced in the past. In particular, 
the technological revolution increases uncertainty as each new stage perturbs existing 
structures, limitations and constraints. 


For example, the combination of a hazard and operability study and a fault tree 
analysis gives the probability of an accident occurring in a chemical system. Other 
examples can be drawn from chemical engineering. Due to random occurrences of 
molecular collision, complexations and dissociations, chemical processes exhibit a 
stochastic behavior [15]. It has been observed that in a fluidized bed the conversion 
rate was higher than in a rotating kiln; or in a steady layer. Based on the available 
models, it was impossible to explain these phenomena. In order to explain these 
phenomena, the virtual rate constant was assumed to be a stochastic variable dis- 
tributed according to the uniform law, by Blickle et al. [9]. Stochastic models are 
used to represent the randomness and to provide estimates of the media parameters 
that determine fluid flow, pollutant transport, and heat-mass transfer in natural 
porous media. Indeed, the pore structure in these media is very complex [41]. 


This chapter presents the foundations of probability, discrete stochastic processes, 
and gives several examples of stochastic processes such as random walk, Markov 
chains, renewal processes and martingales. We also present the properties of mar- 
tingale processes and a set of convergence theorems. The next section presents the 
main tools related to probability theory. 


1.2. Foundations of Probability 


Let us recall some results related to the algebra sets. The main operations on sets 
are “and”, “or” or “+”, and “not”, which are denoted by N, U, = or -° respectively. 
These operations produce new sets. The Venn diagram of these operations is drawn 
in Figure 1.1. 
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A 
A 


Figure 1.1 Operations on sets 


The following table summarizes the main properties of the operations on sets. 


A C B € B* C AS 
AB — BAand AU B — BUA 
(AB)C = (A)BC and (AU B)UC = AU(BUC) 
A(BUC) = ABU AC and AU BC = (AU B(AUC) 
(AB)* = AC U B°, (AU B)* = A^ B^ and AU B = A + A°B 
(Mer Ai) = Uta AP (Ufsi Af = D Af 
U= Ai = A1 U (ASN. A) U (AG N A$ N A) Ue 

U (APN An Az LP An) 


where AB denotes A N B. The inclusion operation C constitutes an order relation. 
Indeed, 


A C A (reflexivity) 
A C B and B C A => A= B (antisymmetry) 
A C B and B C C = A C C (transitivity). 


The expressions related to 


De) e 


correspond to what is called the De Morgan theorem. 


Notice that we can also define the difference between two sets. If A and B represent 
two subsets, then the set defined by 


A— B= {x,x €A,x ¢ B) 
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Figure 1.2 Symmetric difference between two sets 


constitutes the difference between A and B. This difference is sometimes denoted 
by A\B. The symmetric difference between two sets A and B is denoted by AAB, 
and is given by 


AAB = (A — B)U(B— A) 
= (AN B^) U(A* n B). 


Figure 1.2 clarifies these definitions. The shaded area corresponds to AA B. 


Finally, observe that a set A can be decomposed into a collection of sets A; called 
atoms such that the A; are not empty, pairwise disjoint, and their sum is equal to 
A. As an example, consider the following set: A = {a,b,c}. Its decomposition 
leads to: 


A1 = {a,b,c} 

A; = {a,b}, A2 = {c} 

Ai = {a,c}, A2 = (b) 

A1 = {b,c}, A2 = {a} 

A1 = {a}, A2 = {b}, A3 = {c}. 


1.2.1. Intuitive Definition of Probability 


The previous properties play an important role in probability theory and permit one 
to easily understand many theoretical results. Let us associate each point w from 
a given set A (i.e., w € A) with an "elementary event,” and denote by Ag some 
collection of such elementary events, that is, Ao C A. Let us call Ao an "event" (see 
Figure 1.3). 


Then we may introduce the probability Pr(Ao) of the event Ap as a measure 
(“square” or “volume”) of this set. The exact definition of “probability” will be 
given in Subsection 1.2.2. 
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Figure 1.3 7llustration of events 


In order to illustrate the usefulness of these definitions, in view of the Venn diagram 
we may conclude that: 


Pr(A U B) = Pr(A) + Pr(B) — Pr(An B) 
or 


Pr(A U B) x Pr(A) + Pr(B) 
Pr(A^) = 1 — Pr(A) 
Pr(B — A) = Pr(B) - P(A) ifACB 
Pr(AN B^)20 ifAC B. 


If AN B = Ø, then 
Pr(A U B) = Pr(A) + Pr(B). 
We say that: 
1. A implies B if 
Pr(An B®) = 0; 
2. A= RB if A implies B, and B implies A, i.e., [63], 
Pr(AAB) = 0. 
In the next example we show how the Boole inequality relates the probability of 
the union of sets to the sum of the probability of the sets composing the union 


considered. 


Example 1 (Boole inequality) (see /18]) Let A1, A2,... be independent events. 
Let us show that 


"( a) < y Pr(Ai). 
i=] i=] 
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Solution Let us consider the following sets 


Bı =A, and Bj Aj: Aj Ai fori 22. 


Then 
Bin Bj = 8, ifj 
and 
co oo 
U B; = U Ai, 
i=} i=] 
SO 


Pr (U a) = Pr (Ü a) - 3:277] 
i=] i=] i=] 


Taking into account that 
BCA, i221, 
we derive | 
Pr(B;) < Pr(Ai), 


which yields the desired result. 


5 


The next example concerns the convolution which plays an important role in the 


framework of reliability, and on renewal processes in particular. 


Example 2 (Convolution) (see /21, 23]) Let X and Y be nonnegative independent 


random variables with probability distributions 
pj =Pr(X =j) and qj — Pr(Y = j). 
Next, let 
S=X+Y 


and the event 


Let us calculate the probability 


Pr(S = k). 
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Solution The event S = k is the union of the following mutually exclusive events: 
(X =0,Y =k), (X=1,Y =k—-1),...,(X =k, Y =0) 
and, consequently, the distribution Pr(S = k) is given by 
Pr(S = k) = pogk + Piqk—-1 + P2gk-2 +--+ + Pkqo- 


This corresponds to the convolution of the sequences (py) and {qx}, denoted by x: 


{pk} * {qx}. 


These examples are related to the axioms governing a probability function. 
The main axioms have been stated by Kolmogorov [38]. In what follows, we shall 
present an example related to some calculations carried out in the framework of 
reliability. 


Let the time to failure T be continuously distributed with probability 
F(t) = Pr(T < t). (1.1) 
F (t) thus denotes the probability that the unit fails within the time interval (0, t]. 


There is a straightforward relationship between the time to failure distribution and 
the equipment’s reliability function. The reliability function R(x) of the equipment 
is given by 


R(x) = Pr(T > x) = 1 — F(x), x > 0. 


We shail calculate the reliability for the series structure of N components (see 
Figure 1.4). 


The system depicted in Figure 1.4 is functioning if all the N components are 
functioning. For the calculation of reliability of a given system, we shall present 
a method based on the Venn diagram. 


For the clarity of presentation, let us consider only two systems in series (N = 2). 
Let us denote by Pr(Struc), Pr(Sys1) and Pr(Sys2) the probabilities that the whole 
system is in operation, the system number 1 is in operation, and the system number 2 
is in operation, respectively. Next, consider the Venn diagram depicted in Figure 1.5. 
The probability that the system 1 is in failure mode is given by 


Pr(Sys1) = 1 — Pr(Sys1), Sys] = Sys1°. 


Figure 1.4 Series structure of N components 
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~ 
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Struc M Sys1 Y 


— 


Figure 1.5 Venn diagram in reliability 


Observe that 
Struc = Struc N Sysl + Struc Sys], 
then 
Pr(Struc) = Pr(Struc N Sys) + Pr(Struc N Sys). 
Taking into account that 


Pr(Struc N Sys1) = Pr(Struc | Sys1) Pr(Sys1) 


and 
Pr(Struc N Sysl) = Pr(Struc | Sysl) Pr(Sys1) 
= Pr(Struc | Sys1)(1 — Pr(Sys1)), 
it follows that 
Pr(Struc) = Pr(Struc | Sys1) Pr(Sys1) 
+ Pr(Struc | Syst)(1 — Pr(Sys1)) 
and 


Pr(Struc) = Pr(Struc | Sys1) Pr(Sys1) 
+ Pr(Struc | SysI) (1 — Pr(Sys1)). (1.2) 


For two systems in series, we have 


Pr(Struc | Sys1) = Pr(Sys2) 
Pr(Struc | Sys1) = 0. 


We thus obtain 
Pr(Struc) = Pr(Sys1) Pr(Sys2) 
and in general 


N 
Pr(Struc) — I] Pr(Sys i). 


i=l 


8 Stochastic Processes 


Figure 1.6 Parallel structure of N components 
For a parallel structure of N components (see Figure 1.6), the system is functioning 
if at least one component is functioning. 


For the developments in what follows, we shall consider only the case when N = 2. 
Using (1.2) for 


Pr(Struc | Sys) = 1 
Pr(Struc | Sys1) = Pr(Sys2), 


we get 


Pr(Struc) = Pr(Sysl) + Pr(Sys2)(1 — Pr(Sys1)) 
= Pr(Sys1) + Pr(Sys2) — Pr(Sys1) Pr(Sys2). 


For parallel structure with N independent components, the reliability R(t) is 
given by 


N 
R(t) 21-[[a - Rk), 


i=l 


where R; (t) represents the reliability for the ith component. 
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What we have just presented is a method based on the Venn diagram for the cal- 
culation of the reliability of structures consisting of systems in series and systems 
in parallel. This method can be extended to any structure if we are able to calculate 


Pr(Struc | Sysi) and Pr(Struc | Sysi). 


In order to illustrate this fact, let us consider the bridge structure drawn in Figure 1.7. 
If the Sys3 is functioning, then the global system is equivalent to the system drawn 
in Figure 1.8. If Sys3 is in failure mode, then the overall system is equivalent to the 
system depicted in Figure 1.9. 


Figure 1.7 Complex structure 


Figure 1.8 Equivalent structure when Sys3 is functioning 
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Figure 1.9 Equivalent structure when Sys3 is in failure mode 


Writing (1.2) in terms of Pr(Sys3) and Pr(Sys3), we derive 


Pr(Struc) — Pr(Struc | Sys3) Pr(Sys3) 
+ Pr(Struc | Sys3)(1 — Pr(Sys3)). 
We have expressed the fact that if Sys3 is functioning, the global system is also 
functioning if Sys! or Sys2, and Sys4 or Sys5, are not in failure mode (see 
Figure 1.8): 
Pr(Struc | Sys3) = [Pr(Sys1) + Pr(Sys2) — Pr(Sys1) Pr(Sys2)] 
x [Pr(Sys4) + Pr(Sys5) — Pr(Sys4) Pr(Sys5)]. 


When Sys3 is in failure mode, the global system is functioning if Sys] and Sys4 
or Sys2 and Sys5 are functioning: 


Pr(Struc | $ys3) = Pr(Sys1) Pr(Sys4) + Pr(Sys2) Pr(Sys5) 
— Pr(Sys1) Pr(Sys4) Pr(Sys2) Pr(Sys5). 


These relations only represent the fact that the global system works if and only if 
there exists a path from the input to the output. 


A remark is in order here. 


Remark 1 /t is easy to manipulate Venn diagrams. They help us to state our 
intuition, but they can anyway be considered as a tools for the proof of theorems 
about sets [18]. The following example (see example 1.2.9. in [18]) shows that Venn 
diagrams must be used with care. Indeed, if the cases (a) and (b) (see Figure 1.10) 
are respectively used to prove that A U B is not a subset of B, we will obtain two 
different results. 


Now we shall be concerned with indicators of events. Indicators of events play 
an important role in stochastic processes. They permit operations on events to 
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Figure 1.10 Venn diagram: A U B is a subset of B? 


be transformed into algebraic operations. The indicator is a binary random vari- 
able. It takes the value one when the event “indicated” occurs and zero otherwise. 
Let us denote the indicator by 14 (w) 


1 ifocA 
14(w) = lo Fog A: (1.3) 


From (1.3), it readily follows that 
Ig=1 and 1g-0. 
The indicator function exhibits the following properties: 


if A C B, then 14 < 15, and conversely 

if A = B, then 14 = 1g, and conversely 
Ige=1—1,4, las—]1lA4lgs, late =lat+le 

laus = l4 + l4cg = lA c dp — lag 


n n 
Inn, Aw = ll lao lpha = 2 14, 
lip ia, = la + (01 —71Aa)la; t0 (0-714): (0 7 14, 1a, 


where 14: denotes the contrary event (not A, A), AB denotes that both the event A 
and the event B occur, etc. 


Stochastic processes are defined in what follows. 


1.2.2. Random Variables 


Before introducing the notions of events, set of elementary events and sigma 
algebra, let us consider the “goose” game. This very simple game consists of 
two stages: 


l. to throw a dice; 
2. to move the pawn of the goose the number of squares equal to the outcome of 
the dice. 
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The number of different outcomes is equal to six (1, 2,3, 4,5, or 6). The outcome 
represents an event that will be denoted by w e (1,2,3,4,5,6]. 


If we toss a coin, the number of outcomes (events) is equal to two. We can associate 
a number with each outcome, e.g., “0” for tails and “1” for heads, i.e., » € (0, 1). 


Definition 1 (Event) An event is the occurrence or nonoccurrence of 
a phenomenon. 


The set of elementary events w (samples or possible outcomes of an experiment) 
will be denoted by Q = {w}. For the previous examples, $2 = (1,2,3,4, 5,6] and 
Q = (0, 1}, respectively. 


Observe, that for the "goose" game, the position of each pawn depends on the 
history of the game. The history will be designated by what is called a sigma-algebra 
associated with this history. Below, we will present the general definition. 


Definition 2 (Sigma-algebra) The system F of subsets of Q is said to be the 
o -algebra associated with Q if the following properties are fulfilled: 


lL QEF; 
2. for any set An € F (n = 1,2,...) the countable union of elements in F 
belongs to the o -algebra F, as well as the intersection of elements in F: 


oo oo 
LJaneF, [Ane F; 
n=) n=) 


3. for any set A € F, its complement belongs to the o-algebra: 


A:={wEeQ\|w¢e AEF. 


In other words, the o-algebra is a collection of subsets of the set Q of all possible 
outcomes of an experiment, including the empty set Ø. 


Consider a game that consists of tossing a coin. If the outcome is tails, the player 
wins 20cents, and if the outcome is heads, he loses 15 cents. It is clear that 
the fortune of the player depends on the history of the game, i.e., the o-algebra. 
The c -algebra is a family of events associated with a random experiment. 


Let us consider a pack of 52 cards. The probability of drawing a given card is equal 
to 35> and the set of events Q consists of 52 elements. The ø -algebra consists of 
all possible packs of cards including the global set (52 cards) of cards. 


Let us consider another example, the case when 2 is a subset X of the real axis 
R! ie., 


Q-XcR! 
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and define the set A:= (a, b) as the semi-open interval (a, b) € R!. Then the o- 
algebra B(X) constructed from all possible intervals (a, b) of the real axis R! is 
called the Borel o-algebra generated by all intervals belonging to the subset X. 
Borel spaces include almost all useful probability spaces. 


Simply put, the ø -algebra represents a collection of any possible events. 


Let us consider the following discrete observation of a given process output 
y(n) = ae(n) + e(n — 1), 


where y(n) and T represent the process output at time nT and the sampling period, 
respectively. (e(n)) is a sequence of independent random variables with zero mean 
and finite variance o°. At time t = nT, the future values of the output y cannot be 
calculated. Based on the observations made up to the time nT, the future values of 
y can be predicted. It is evident that we cannot use a crystal ball for these purposes. 
The predictions of the future values of the output will be calculated on the basis of 
the available observations. In other words, the predictions are conditioned by the 
o -algebra defined by Fa = o (y(0),..., y(n — 2), y(n — 1), y(n)) which represents 
the available information or the history of the output. Indeed, the solution of the 
prediction problem consists of splitting the expression of the process output into 
two terms: one term depending on the c -algebra (the available information), and the 
other on the future values of the random sequence (e(n)), which is not predictable 
from past data. 


The probability measure! (i.e., a function mapping F — R!) that assigns a 


probability to each set in the field F will be denoted by Pr. This function satisfies 
the following conditions: 


Pr(A) 20 forevery AEF, Pr(Q) = 1 


Pr (Ü D = 2 Pr(A&) 
k=1 k=l 


| Let X be a general set and B(X) a countably generated o -field on X. A measure yz on 
the space (X, B(X)) is a function from B(X) to (—oo, +00], such that 


oo oo 
(O n) =} u(i), 
i=l i=l 


where 
Aj € B(X), i=1,2,... and A;NAj=ð, ifxj. 


The probability is a positive normalized measure ((A) > 0 for any A) on B(X). 
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for every mutually disjoint set A1, A2,...in F, and 
A; 1A; — G # Jj) 

where @ represents the empty set. 


Let us consider again the “goose” game. It is clear that if we throw a dice 6n times 
(n is very large), we will obtain z times 1, for example. In other words, the frequency 
of occurrence of 1 is equal to 


It is natural to associate the probability i with the outcome of each face of the dice. 


Definition 3 (Probability space) The triple (2, 7, Pr) is called a probability 
space. 


A random variable is a real function defined over a probability space, assuming real 
values (see Figure 1.11). In other words, a random variable is a quantity (such as 
the fortune of the player) that is measured in connection with a random experiment 
(e.g., when tails come up in a coin-tossing game). The position of a given pawn in 
the scene of the “goose” game is a random variable. A Borel function of a random 
variable is also a random variable. A function f which maps points of R, the real 
line, into R is said to be a Borel function if 


to 


Figure 1.11 Random variable 
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Definition 4 (Measurable random variable) We say that X(w) is measurable 
with respect to a sigma field F of subsets of €), or more briefly F-measurable, if 


(o: X(o) xx) €7  forall real x. 


1.23. Stochastic Processes 


Next we shall introduce the definition of a stochastic process. 


Definition 5 (Stochastic process) A stochastic process (X; t € T, T € RI, 
X, € R”, is a family of random variables indexed by the parameter t and defined 
on a common probability space (Q, F, P). 


For each t,w € Q, X, (mw) is a random variable. 


For each w,t € T, X;() is called a sample function or realization of the process. 


The stochastic process is said to be continuous if T is a continuous subset on R! and 
is said to be discrete if T is a finite or countable set from R, i.e., (11,12, .. ., ts, . . .]. 
In this chapter we shall be concerned with discrete stochastic processes. We will 
think of (t1, 12, . . ., t4, .. .} as being the points in time at which observations of the 
process are available. 


Let us consider the following elementary example of a stochastic process: 


Example 3 (The POLYA urn scheme) Let an urn contain b black and r red 
balls. Let To = b/(b + r). At each drawing a ball is drawn at random, its color 
is noted and a balls of that color are added to the urn. Let b, be the number of 
black balls and an be the number of red balls after the nth drawing. Let Tn be the 
proportion of black balls after the nth drawing. T, is a stochastic process. There 
are two events, w € Q = {red ball is drawn, black ball is drawn}. 


Random walks are used as models for many physical phenomena, like diffusion, 
conformation of polymers, Brownian motion, electron transport through metals, 
round-off errors on computers, etc. An example of a random walk is presented in 
what follows. 


Example 4 (Random walk) Suppose that at time instants 0,1,2,... we toss 
a coin. If heads (h) comes up, we take a step -- Ax forward, and if tails (t) 
comes up, we take a step —Ax backward (Ax > 0). Assume that we execute 
the steps instantaneously. Let Q = (h,t). Let Pr(h) = p, Prt) =q =1—p 
(p =q = 3 fora fair coin). 
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For each time instant, define the random variable wpn: 


+Ax w=h 
—Ax w=t, 


Wn(w) = | 
so that 
Pr[w, lw) = +Ax] = p, Priw, (w) = —Ax) =q. 


Let xn denote our position at instant n, but before the execution of a step at that 
instant. Assuming we start at the origin, xo = 0, our position at n is given by 

n-l 

Xam we dex (1.4) 

en 
One realization of this process is given in Figure 1.12. This process is called a 
random walk. We are concerned with a sequence of the form: 

w = hthhhtt .... 
Notice that equation (1.4) is the solution of the random difference equation 

Xn = Xn-1 + Un-1- (1.5) 
Notice also that a random walk is the sum of the current and past observations 
of a white noise process that corresponds to an orthonormal sequence of random 


variables (e(n)) such that 


1 fori-j 


E(e(n) —0 and Ele(i)(j)) = o fori # j. 


hthhhtt 


Figure 1.12 Random walk 
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The Poisson process, which is used as a model for many phenomena such as arrivals 
of calls, queueing processes, medicine, ecology, teletraffic, etc., will be defined in 
what follows. 


Example 5 (Poisson process) Let us consider the number N(t,t + At) of events 
(phone calls) occurring randomly in the interval of time (t, t+ At). Next assume that 


Pr(N (t,t + At) = 0) = 1 —AAt + o(At) 

Pr(N(t,t + At) = 1) 2 AAt + o(At) (1.6) 

Pr(N (t,t + At) > 1) = o(At), 
where o(At) (see the end of this chapter) represents a function tending to zero 
more quickly than At, and 4 is the expected number of events per unit time. A 
stochastic process fulfilling (1.6) is said to be a Poisson process of rate X for which 
the dimension is [time]" ! . Intuitively, the number of events (phone calls) occurring 
in the interval (t,t + At) is independent of what happens at or before t. The Poisson 


process exhibits the Markov property, i.e., the further behavior depends only on 
the current state of the process but not on its history and future. 


A stochastic processes can be stationary or nonstationary. 

Definition 6 (Stationary process) A stochastic process x(t) is said to be 
stationary (in the strict sense) if its statistics (distribution functions for each fixed t) 
are not affected by a shift in time, i.e., that the two processes 


x(t) and x(t+e) 


have the same statistics for any e. We say that the processes x(t) and y(t) are 
jointly stationary if the joint statistics of 


x(t) and y(t) 
are the same as the statistics of 
x(t+e) and y({t+e) 
for any €. 


A well-known property of stationary time sequences is that they may be represented 
by a linear filter model driven by white noise. 


Remark 2 Based on polynomial identity (theorem of polynomial division), it has 
been shown that an autoregressive integrated moving average (ARIMA) process 
can be decomposed into a sum of a Stationary stochastic process and a random 
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walk. This decomposition is known as a Beveridge—Nelson decomposition [7], and 
is mainly used in economics?. 


We shall now deal with a property which plays an important role in practical applica- 
tions, namely the ergodicity. This property is connected with what is called the law 
of large numbers?. Ergodicity deals with the problem of determining the statistics 
(mean, etc.) of a stochastic process from a single realization (sample function). 


? A polynomial 


1 k 


A(q7) = ay * aq t akg” 
can be decomposed in the following form 

ATD = AQ) + ATF’), 
where q7! represents the backward shift operator and A(q7') -1-— q7! (in economic 


publications q^! is usually denoted by z or L). This decomposition is obtained by dividing 
A(q~!) by A(q7!). It represents a particular case of the following one 


A(q7) = AQ)97" + ATOH) 
for n — 1. This latter decomposition is obtained by considering the following identity: 
AM) = AM + ATDA e 97! c gU) 
and substituting it into the first decomposition. 
Let us now consider an ARIMA process 


C(q7!) 
A(q7!) 


^(q yn0- e(t) = G(q leq), 


where G(q~!) represents the power series expansion of the rational function 
(C(q7!)/ A(q7 5). In the case where the coefficients of this expansion form an abso- 
lutely summable sequence, then G(q ^ 1) can be decomposed according to the decomposition 
introduced above, which leads to 


ATDI = Aq v(t) + Aq ^ wa) 
A(q- wt) = Gelt), w(t) = F(q ^ e(t), 
where v(t) and w(t) represent a random walk and a stationary stochastic process, 


respectively. The proof of the Beveridge-Nelson decomposition can be found in [56]. 
3 The law of large numbers is introduced in Appendix A. 
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Definition 7 (Ergodic process) x(t) is said to be ergodic in the most general 
form (with probability 1) if all its statistics can be determined from a single function 
x(t, c) of the process. 


In other words, the ergodicity property permits us, among others, to use the aver- 
aging procedure for calculating the expected (mean) value of a given stochastic 
process. 


The correspondence between all the possible values of a given random variable and 
their associated probabilities, namely the probability distribution, is presented in 
the next subsection. 


1.2.4. Probability Distribution and Probability Density 


This section introduces in a simple way the skeletal structure of the basic math- 
ematical tools related to probability distribution and probability density functions. 
They play an important role in many areas (control, optimization, image signal 
processing, reliability, chemical engineering, mineral industry, etc.). 


Probability Distribution 

The probability distribution gives the relationship (correspondence) between all 
possible values (realizations) of a given random variable and their associated prob- 
abilities. Certainly, a table is the most simple form for the representation of this 
correspondence [65]. Let x; (i — 1,..., K) represent the possible values of the 
random variable X, and p; (i = 1,..., K) the corresponding probabilities. We can 
derive the following table: 


Xi |X, X2 -c:- XK 
Pi | Pi P2 : PK 


When the value of K is relatively large, this representation is not useful in practice. 
It is better from an engineering point of view to have a graphical representation 
as shown in Figure 1.13. In this figure the values x; (i = 1,..., K) are reported 
on the abscissae, and their associated probabilities p; (i = 1,..., K) are reported 
on the ordinates. The extremities of the vertical lines representing the probabilities 
have been connected by segments. Hence, we obtain a polygonal representation. 


Observe that K can be large, or tend to infinity; and a continuous random variable 
has an infinity of values in a given interval. We shall now be concerned with 
the (cumulative) probability distribution of a continuous variable X which will be 
denoted by F (x), and defined as follows: 


F(x) = Pr(X < x). 


The distribution function contains all the information which is relevant to 
probability theory. 
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Figure 1.13 Graphical representation 


Example 6 Let the probability distribution of a random variable X be 


0 forx «0 
Fy(x)= {x for0<x<1 
1 forx>1. 


Find the probability distribution of the random variable Z = ~X. 

Solution We have to consider only the case when x € [0, 1]. It follows that: 
Fz(x) = Pr(Z < x) = Pr(V X < x) = Pr(X < x”) = Fx(x?). 

The probability distribution has the following properties: 


l. F(x) is a nondecreasing function. 
2. limy_.-o9 F (x) = 0, and lim,4o9 F(x) = 1. 
3. F(x) is continuous at least from the left. 


These properties are evident. In fact, if x increases, Pr(X < x) increases also 
and vice versa, and the probability of the event X < +00 (X < -—oo) is 


equal to 1 (0). 


Example 7 (see Lemma 4.1.1. in [27]) Let the probability distribution F (t) of 
the random variable Y be continuous. Show that the random variable Z = F(Y) 


is uniformly distributed over the interval [0, 1]. 
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Solution In view of the definition of the probability distribution F(t) = Pr(Y < 
t), we have, 


Pr(Z < x) =Pr(¥ < F7!(x)) = F(F (x) =x, 


where F^ (x) represents the inverse of F(t), for 0 € x <1. 


Probability Density 

Now, we are interested in the probability that a continuous random variable X 
belongs to the interval [x, x + Ax]. It is clear that this probability is equal to the 
variation of the probability distribution in this interval, i.e., 


Prix < X < x + Ax) = F(x + Ax) — F(x). 


Let us assume that F (x) is continuous and is differentiable, and consider the mean 
of this probability towards the length unit. If Ax — 0, we obtain 


ie OO ey (1.7) 


Ax >0 Ax dx 


where f(x) represents the probability density function (pdf). In summary, every 
real valued function which is nonnegative (Pr(-) > 0), integrable over the whole 
real axis, and satisfies (1.7) is the probability density of a random variable X. 


The pdf is useful for calculating the probability that a random variable X belongs 
to a given interval, say (a, b). This probability is given by 


b 
Pr(a «x «5 f feds. 


In other words, this probability is equal to the area between the curve f (x) and the 
x-axis. 


In many situations, the probability density function is a priori unknown. One pos- 
sible way to obtain to a mathematical expression of f (x) is to consider a model of 
the form 


f(x) = Y aigi), 


i=] 


where @j(x) (i = 1,2,...) are a priori known functions. A truncated series 
of m terms 


f(x) = D> aigi(x) 


izl 


can then be used as an approximation of the true function. 
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Remark 3 (Modeling) Statistical analysis can be used to fit unknown model 
parameters, and to evaluate the uncertainty related to the fitted model, as well as 
to compare several candidate functions $i (x). One common approach is to use the 
optimization theory to carry out least-squares estimates for the parameters oj. 


Two Random Variables 
Consider a system consisting of two random variables X and Y. As for single 
random variables, we shall define the characteristics of the system (X, Y). 


e Probability distribution: 
F(x, y) = Pr((X < x)(Y x y) 
e Probability density: expression (1.7) is extended as follows: 


Pr((X, Y) € A) = F(x + Ax, y + Ay) — F(x + Ax, y) 
— F(x,y + Ay) — F(x,y), (1.8) 


where A represents a rectangle of dimension Ax and Ay. Expression (1.8) 
leads to 


Pr((X,Y) € A) 1 

"o AUG Ax, y + Ay). 

Ax—>0,Ay>0 AxAy TRE TNT ATA (x + Ax, y + Ay) 
—F(x + Ax,y) - F(x, y + Ay) - F(x, y)). 


If the function F(x, y) is continuous and differentiable, we get 


8?F(x, y) 
foy) ET 


e Moment of order s, k: 
ms, = E(X^Y*), 


where E{-} denotes the expectation operator (see next section). 
e Covariance: 


K yy = Cov(X, Y) = E{(X — m,)(¥ my), 


where m, and my are the mean of X and Y, respectively. The covariance char- 
acterizes the dispersion of random variables and the degree of their dependency. 
Notice that for independent random variables the covariance is equal to zero’. 


Observe that if the dispersion of a given random variable, say X, is small 


4 The converse is only true for the Gaussian case. 
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(the random variable is close to its mean), the covariance will be small even if 
the random variables X and Y are closely dependent. In order to characterize 
only the dependency between random variables, the coefficient of correlation 
is introduced. 

e Coefficient of correlation: 


where o? and o? represent the variance of X and Y, respectively. 


In the subsequent paragraphs, two operators of profound importance are discussed, 
namely the expectation and the conditional mathematical expectation. 


1.2.5. Expectation and Conditional Mathematical Expectation 


Before defining the expectation and conditional mathematical expectation, let us 
consider the behavior of a distillation column where the objective is to obtain 
a distillate with a given concentration, say 90%. The distillate is poured into a 
tank. If a concentration sensor is available, we can record the concentration versus 
time. This concentration will fluctuate in time due to the effect of the variation of 
the ambient temperature and the chemical and physical characteristics of the raw 
material. If we measure the concentration in the tank, however, we observe that the 
concentration is constant, due to the mixing effect, and corresponds to the mean 
value of the recorded concentration (see Figure 1.14). 


The mathematical definition of the expectation is given in what follows. 


Definition 8 (Expectation) The Lebesgue integral (see [2]) 


E(£):— f E(x) dF (x) 


xex 


Cmean 


Figure 1.14 Evolution of the concentration 
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is said to be the mathematical expectation of a random variable £ having a distri- 
bution function F (x) given on X. This integral is understood in the Lebesgue sense. 
If F (x) is differentiable [ f (x) = dF(x)/dx], then 


dF(x) 


Ei)». f om as. f ee rax. 


xeX xeX 


In the discrete case, the expectation corresponds to the sum of the random variables 
weighted by the probability with which they are assumed. 


The variance represents another characteristic of a given random variable. It char- 
acterizes the fluctuation of a given random variable around its mean value. In the 
example related to a distillation column, the energy of separation increases when 
the concentration obtained is greater than the desired value and decreases when 
the concentration is less than the desired value. We can easily understand that the 
energy consumption will be intimately related to the fluctuations of the concentra- 
tion around its mean value (desired value). In other words, the energy consumption 
will depend on the variance of the column output, i.e., the concentration of the 
distillate [47]. Tracer tests on process vessels are commonly used for the determin- 
ation of the residence time distribution curve. Statistical parameters that describe 
the curve, such as the mean and the variance can be used to diagnose operating 
problems within the vessel. 


Definition 9 (Variance or second moment) The variance of a given random 
variable & is given by 


Var(&) := E(( — ED?) = E((§)"} - (EE? = of. 


The variance of the sum of independent random variables is equal to the sum of 
the variances of these random variables. In fact, if Sn = 3 7 , 5j, then 


o$, = E(($y — E?) = ) o. 
izl 


This result corresponds to the Bienaymé equality [42]. 


Definition 10 (Centering) Centering corresponds to a variable change. In other 
words, we say that & is centered at c if we replace it by 


& —c. 
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From the previous definition, we deduce that: 


1. A random variable is centered at its expectation if, and only if, its expectation 
exists and is equal to zero. 
2. The variance of a given random variable remains unchanged by centering, i.e., 


Var (&) = E(( — E(£?) = Elé c — Elg — c?) = of. 
Definition 11 Let be a random variable such that 
E(E|I"),  m=1,2...,N 


exist, and forms a space Ly over the probability space (Q,F,Pr). In symbols, 
E€Ly. The space Lz plays an important role in the investigation of probab- 
ility problems, especially in those related to the sums of independent random 
variables [42]. 


Let us now consider the notion of integral in the framework of probability theory. 
First, consider a tank fed with a flow rate q(t). During the time interval [r, t], 
the quantity of water poured in the tank will be denoted by I q . This quantity is 
bounded as follows: 


t 
ma-n) s fas M - n. (1.9) 


n 


where m and M are the minimum and the maximum flow rates, respectively, dur- 
ing the interval of time [t1, t2]. If we consider another interval of time [t5, t3], 
we can write 


hn 


t3 t3 
fa+fa=fa (1.10) 
hn ti 


t 


These two properties (1.9) and (1.10) characterize the integral. The existence of 
the integral of a given function needs a constructive definition of the integral in 
some sense. 


For continuous functions, there exists an identity between the notions of integ- 
rals and primitives?. Riemann has defined the integrals of certain discontinuous 
functions, but all derivable functions are not integrable, in the Riemann sense. 
The problem related to the search of primitive functions is not solved by integration, 


5 A function F(x) is called a root function or a primitive function of f (x) if and only if 
F'(x) = f (x). For example, the primitive of f(x) = x? is equal to x* /4. 
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and we desire a definition of the integral which includes the Riemann one as 
a particular case. 


Remark 4 (Riemann integral) To define the integral of a continuous function 
f(x), x € [a, 5}, 
the interval [a, b] is partitioned into a set of intervals, i.e., 
X974, X1,...,X, =b, XQ < X] < +++ « Xs 
Aj = [xj-1, xi], a; = f (ti), & is a given point in Aj. 


The integral in the Riemann sense is given by 


b n 
f 1029 m Ys 
y: i=l 


if the limit exists. 


Instead of partitioning the domain, we could partition the range space [66]. This 
is more complex, but it does expand the functions that one may integrate to more 
than just piece-wise continuous functions. This is the entire difference between 
Riemann integration and Lebesgue (Lebesgue-Stieltjes) integration and the reason 
why the Lebesgue integral is used. 


Example 8 (Counting coins) (see [66]) The Lebesgue integral, in some sense, 
is more natural than the Riemann integral. Consider the difficult task of counting 
a jar full of coins. The jar is big and the coins in it are pennies, nickels, dimes and 
quarters. There exist at least two methods to count the coins. 


1. Pull the coins from the jar, one by one, tallying the total. 

2. Pull the coins from the jar, one by one, grouping them in piles of like kinds. 
Then count the number of pennies, nickels, dimes and quarters, respectively, 
and do the obvious arithmetic. 


Case 1 corresponds to the Riemann integration, and case 2 to the Lebesgue 
integration. 


Remark 5 (Lebesgue integral) Let us consider again the function f(x). 
x € [a, b]. 
This function varies in the interval |m, M]. 

Six) € [m, M). 


Let us consider the following partition of the interval [m, M]: 


m=mo <m,<m2<---<M=mp. (1.11) 
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In very simple words, the integral in the Lebesgue—Stieltjes sense is defined on the 
basis of the partition (1.11). 


Remark 6 Notice that even if f is integrable in the Riemann sense in [a,b], 
the function 


b 
Fœ = f oax, a<x<b 


is not necessarily derivable. An important theorem of the Lebesgue—Stieltjes 
integration theory is that F'(x) exists “almost everywhere” in a certain sense. 


Example 9 The integral in the Lebesgue-Stieltjes sense permits, among other 
things, one to justify the following formulae: 


b b 
Jim, [5 - f im fas 
a a 
b 


oo b eo 
X, f fax) dx = f fax) dx. 
l 


n=l ù a n- 
Here it is assumed that all the limits of both left- and right-hand sides exist. 


Remark 7 (Lebesgue-Stieltjes measure) A salient underpinning of probabil- 
ity theory is the one-to-one correspondence between distribution functions on 
R” and the probability measure on the Borel subsets of R". Verification of this 
correspondence involves the notion of measure extension. 


Remark 8 There exist various definitions (not equivalent) of the integrals. The 
difference between them resides in the set of functions for which the integral is 
defined. 


Usually, there exists a dependence (relationship) between random variables. There- 
fore, the next definition deals with the definition of conditional mathematical 
expectation. 


Definition 12 (Conditional expectation) The random variable E{E | Fo} is 
called the conditional mathematical expectation of the random variable &(w) 
given on (Q, F, Pr) with respect to the o -algebra Fo C F if 


1. itis Fo-measurable, i.e., 


lw | E{E| Fo} Sx) € Fo Yx eR 
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2. for any set A € Fo, 
f E(& | Fo} Pride) = f E(o) Pr(de] 
weA weA 


(here the equality must be understood in the Lebesgue sense). 


The basic properties of the conditional mathematical expectation will be presented 
in the following. 


Let €=&(w) and 0=0(w) be two random variables given on (Q,F, Pr), 
where 0 is Fo-measurable (Fo C F). Then (see [2]): 


l. E(0| Fo} = 0; 
2. E{OE | Fo} = OE(E | Fo); 
3. E{E{E | Fi) | Fo) = EE | Fol if Fo c Fi C F. 


Notice that if £ is selected to be equal to the indicator function of the event 
A € fF, ie., 


.. |1 ifthe event A has been realized 
E(w) = x(w, A) := 0 ifnot 


from the last definition we can define the conditional probability of this event 
under fixed Fo as follows: 


Pr(A | Fo} :— E{x(@, A) | Fo}. 
Remark 9 (Properties of conditional expectation) 


1. Conditional mathematical expectation is a linear operator, i.e., if a and b are 
constants, then 


E{aY, + bY2 | Fa} = E{aY, | Fn} + E(bY2 | Fn} 
= aE(Yi | Fn} + DE{Y2 | Fn}. 


2. IfY is already a function of X\,...,Xn, then 
E(Y | Fna} =Y. 
3. ForanyY, if m « n, then 


E(Q | Fn) | Fm} = ERY | 7n). 
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4. If Y is independent of X\,...,Xn, then information about X\,...,Xn should 
not be useful in determining Y and 


E(Y | Fn} = E(Y). 


Recall that a stochastic process (xa, n € N} is a collection (family) of random 
variables indexed by a positive integer parameter n and defined on a probability 
space. 


Example 10 Let X(t) be a random function with mean m,(t). Determine the 
mean of the following random function 


t 


yo = f xc. 
0 


Solution Let us put the integral into a form that is more amenable for the 
derivation of the mathematical expectation my(t) of the random function Y (t) 


t 
Y(t) = [xoa = lim, 2 1 XGDAr, 
0 i 


then the mean of Y (t) is given by 


my(t) = E{¥(t)} = in 2 Papas = Jm, 2 fas 


t 
SEES At = tj — tj-|. 
0 


In this derivation we have assumed that the mathematical expectation of the limit is 
equal to the limit of the mathematical expectation. In practice, this result is true. In 
summary, the mathematical expectation of the integral of a given random function 
is equal to the integral of its mathematical expectation. 


The next example is related to the approximation of the mean and the variance of a 
random function on the basis of the mean and the variance of the random variable. 


Example 11 Consider a random variable X with mean m. Determine approx- 
imate values of the mean and the variance of the random variable Y = $(X). 
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Solution Let us consider the Taylor expansion of the function $ (x) around m: 


, (x — my H 
(x) = $n) + (x — m)$ (m) + ~ * (m) t --- 


Taking expectation and neglecting the third and higher terms of the previous 
expansions, we get 


Elp (X)] = o(m). 
Observe that $ (m) is a constant. It follows that: 


Varl (X)] = (6^ (m) Var[X — m] = (9 (m) Var[X]. 


In the next example we shall determine the parameters of a linear estimator. 


Example 12 Let us consider again two correlated random variables £ and n such 
that E is observable. Determine the best estimator (in the mean squares sense) 
in the class of linear estimators L(E) = a + bE of n, and the minimal value of 
the estimation criterion. 


Solution The linear estimator L* (£) in the mean squares sense is given by 


E([n - L'()]) = inf E([ — L(E))^). 


By setting the derivatives of the criterion E([n — a + bEY.) with respect to the 
unknown parameters a and b, we obtain 


8E([n — (a + b)Y) 


= -2Eln -a — bt) 
da 
GE {In — (a + bE)Y 
Ela = @+ODP) | oro a — BEI 


which leads to 


n = E{n} — b* Ett) 
E(n&) — E{E}E{n} — b* E{é}? — b*E(£?) = 0. 


6 This approximation around the mean m corresponds exactly to the approximation of the 
curve associated to the function Y = $ (X) by the tangent on this curve at the point for which 
the abscissa is equal to m. 
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Taking into account that the variance of £ and the covariance Cov(Ẹ, n) are given, 
respectively, by 


E([E — E(E)I) = E(£?) — 2E(£)E(E) + E(£ 
= E(£) - E(£ (1.12) 


Cov(&,n) = E((6 — E{€})(n — E{n})} 


= Efn} — E{€}E{n} (1.13) 
we derive 
« _ Cov,n) 7 Cov(&,n) 
b = ee = E(n) - LE 
Var(£) a (n) Var(E) {£}. 
Consequently, the optimal linear estimator is given by 
Cov(é, n) 


L*(&) = E(£) + ——.—(6 — EIE). 


Var(£) 


The minimal value of the estimation criterion is given by 


2 
Jin = Elin — L'GOF) = E [> - E(}- en ( - ED | | 


= sin - eiie e | Seem Cov.) ig _ Ete) | ] 


Var(£) 
Cov 
EE D - i|] 
Cov? (E, 
= Var(n) — E = Var(n) [1 = rl. 


In this derivation we have used expressions (1.12) and (1.13). Observe that the 
minimal value Jmin decreases when the absolute value of the covariance (coefficient 
of correlation) increases, and if & and are uncorrelated (rz; = 0), the estimator 
and the minimal value Jmin reduce respectively to 


L'(£) = E(£) and Jmin = Var(n). 


The realizations (observations) of stochastic processes are collected in a data 
base, or recorded using graphics. They constitute the available information. One 
can ask how to characterize a given stochastic process towards this information. 
This characterization is defined in what follows: 


Definition 13 (Stochastic process adapted to a sequence) 4 stochastic process 
X; is said to be adapted to the sequence of o-algebras (7,) if X, is measurable 


32 Stochastic Processes 


for all t. In other words, the information contained in F, is sufficient to completely 
specify X;, ie., 


E{X; | Fi} = X;. 


Properties 

1. If Xi, X2,... is a sequence of random variables, 7, denotes the “inform- 
ation” contained in X1, X2,..., Xn. We will write E(Y | Fna} for E(Y | 
Xi, X2,..., Xn}. 

2. IfFn-1 C Fn C F forn = 1,2,...,wesay that {F,,} isan increasing sequence 
of sub-sigma algebras. 

3. If 


A=E{X|Fna}, B= E{X | Fn} 
A=B 
where a.s. stands for “almost surely" (equality with probability one’). 
4. IfY is independent of Xi, X2, . . . , Xn, then information about X, X2,..., Xn 
should not be useful in determining Y and 


E{Y | Fn} = E(Y). 


5. IfY is any random variable and Z is a random variable that is measurable with 
respect to X1, X2,..., Xn, then 


E(YZ | Fa} = ZE(Y | Fy}. 
6. If Fn-1, Fn are sub-c -algebras of F with 74.1 C Fp, then 
E(EAX | Fn-1} | Fa} = E{X | Fr} 
E(EAX | Fa} | Fa-1} = E{X | 7,1). 
7. Smoothing properties: 


E(EUXC | Vises Yn-1} Vises Yn} = E(X | Yis.. Eua] 
E(E(X | Yi, Yn} | Yis- --, Yaa} = E(X | Yip.. -p Yn-1} 


7 Two random variables X 1 (œw) and X2(o) are equal with probability (or, almost surely) if 


Pr(X1(o) = X2(o)) = 1. 
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The following examples illustrate how to use the conditional mathematical expect- 
ation. Consider independent, identically distributed real-valued random variables 
X1, X2,... (not necessarily positive) and define the partial sums 


n = Xi Xo coc Xn, n=1,2,... (1.14) 


So = 0 (by convention). (1.15) 
The process of partial sums (1.14) plays a fundamental role in many diverse areas 
of applications of stochastic processes. Thus, the analysis of embedded recurrent 
events in Markov processes, renewal phenomena, queueing and dam systems, cal- 
culations of risk and ruin probabilities, etc., all revolve about discerning properties 
of certain random functionals connected to the process (1.14) [35]. 


Example 13 (Partial sum) Suppose X\,X2,... are independent, identically 
distributed (iid) random variables with mean u. Let S, denote the partial sum 


Sn = Xi e X, 
Let Fn denote the information in X1,..., X4. Suppose that m < n. Show that 


E{Sn | Fm} = Sm + (n — m)u. 


Solution The conditional mathematical expectation is a linear operation. 
It follows that 


E{Sn | Fm} = E(X1 +--+ + Xm | Fm} + E(Xm41 +- +X | Fm}. 

Since X, + ---+ Xm is measurable with respect to Xi,- -- , Xs, we derive 
E{X, t c Xn | Fm} = Xi +--+ Xm = Sm. 
Since Xm+1 +++: + Xn is independent of X| +---+ Xm, we derive 
E{Xm+i t Xn | Fm} = E{Xm41 t t Xn} = (n— m). 
Therefore 
E(S | Fm} = Sm + (n — m)u. 

Example 14 (Partial sum squared) Suppose Xj,X2,... and S, are as in 
oe 13. Suppose p = 0 and Var(X;) = E ((Xi)?] = 07. Let m <n. Show 


E{(Sn)? | Fn} = (Sm)? + (n — mo?. 
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Solution The conditional mathematical expectation is a linear operation. 
It follows that 


E{(Sn)” | Fm} = E([Sm + (Sn — S201. | Fm} = E{(Sm)? | Fm} 
+ 2E{Sm(Sn — Sm) | Fm} + E(I$n — Sn]. | £m]. 


Since Sm depends only on X1,..., Xm and Sn — Sm is independent of Xi, . .., Xm, 
we have as in the proof of the previous example 


E{(Sm)? | Fm} = (Sm)? 
E{Sm(Sn — Sm) | Fm} = SmE{(Sn — Sm) | Fin} = SmE{Sn ie Sm} = 0. 


Therefore 
E{(Sn)” | Fm} = (Sm)? + (n — m)o?. 
Example 15 (Edmundson-Madansky inequality (see /20]) Let € € [a,b] 
be a random variable, and consider a convex function Q(x,&). Show that 
E{Q(x,&)} x E{Q(x,&)}, 


where Ẹ represents a scalar random variable attaining values 


a with probability pq = — $0 
b-a 

b with probability pp = 2 zi: 
—a 


with 


b 
fy = Eit) = J £ Pr(dé). 


Solution From the convexity property of the function Q(x,&), we derive 


(b — a)Q(x,€) «&(b—&)QG,a) + (6 —a)Q(x,b) VE e [a,b]. (1.16) 


8 The Edmundson-Madansky inequality gives an upper bound for expectations of convex 
functions. 
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Taking the mathematical expectation on both sides of (1.16) leads to 


b 
1 
E{Q(x,€)} < — fte — §)Q(x,a) + (E — a) Q(x, b)] Pr(d&) 


ap agat 4 gui 
b—a b—a 
= E{Q(x,®)}, 


which corresponds to the desired result. 
1.2.6. Characteristic Function 
Let us now return to probability distributions. The characteristic function? of a 


distribution function F(x) is an integral transformation. It is a complex valued 
function defined on R by 


+00 
o(t)= f exp(itx) dF(x) 
00 
+00 +00 
= f cos(tx)dF(x)+ i f sin(tx) dF (x) 
—00 —oo 
= E{exp(itx)}. (1.17) 
This integral exists since | exp(itx)| = |cos(tx) + isin(tx)| = 1, bounded for 


all x. It is used for calculating the moments of different orders for a given random 
variable. 


In other words, the characteristic function is the Fourier transform of the distribu- 
tion function (integral transformation). The integral transformations like Fourier 
and Laplace are commonly used in signal processing, process control, ordinary 
differential and partial differential function equations integration and in the calcu- 
lation of convolution of distribution functions. In the framework of reliability, the 
Laplace transformation is mainly used for the calculation of the renewal function, 
which involves convolution operations. 


In what follows, we shall summarize its main properties: 


1. $(0 — I. 
2. Ol <1. 
3. o(—t) = ġ* (t), where $* (t) represents the complex conjugate of $ (t). 


? Lyapunov introduced the characteristic functions in 1900, to prove the central limit 
theorem. 
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4. (t) is uniformly continuous on the entire real line ]—oo, --oo[. 

5. The relation between distribution functions is one to one. In other words, the 
knowledge of the characteristic function is synonymous with the knowledge 
of the distribution function. 

6. The characteristic function of a random variable y 


y=ax +b, a,b = cte 
is equal to 


+00 


Elexplit(ax + b)]) = f exp[it(ax + b)jdf (x) 


—oo0 
+00 
= exp(itb) f exp(itax)df (x) 


= exp(itb)ó (at). 


7. The characteristic function of the sum S, of independent random variables x; 
is the product of their characteristic functions 


n 
$ = Elexplit($)]) = [ [6i 
i=] 
8. The moments (when they are finite) of random variables can be determined by 
differentiating the characteristic function. 


9. Two characteristic functions are equal if and only if they correspond to two 
random variables having the same distribution function. 


For nonnegative integer-valued random variables, the characteristic function is 
obtained from the generating function 
+00 
v(s)- f exp(st)d F(t) (1.18) 
0 


by replacing the Laplace operator s by ix. 


From the previous developments the question arises “is it possible to derive the 
distribution function from its characteristic function?” The answer to this question 
is “yes” and corresponds to the Fourier inversion formulas or the so-called inversion 
theorems. 


Example 16 Let us consider the exponential density distribution 


Aexp(-Ax) forx >0 


f(x) = lo oras 0s dF(x) — f (x) dx. 
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Its characteristic and generating functions are respectively given by 


+00 +00 A 
$t) = f exp(itx) dF(x) = À f exp(itx — Ax) dx = PE 
—oo 0 
+00 A 
vos f exon sear = —. 
0 


Remark 10 The exponential distribution is commonly used for modelling 
durations such as unemployments spells, etc. 


A detailed presentation of the main probability distributions is postponed until 
Chapter 2. 


So-called stable distributions play an important role in probability theory and its 
applications. This is basically due to the fact that they appear as limit distributions 
for sums of i.i.d. variables. There are various definitions of stable distributions. 
It is simple to define the distribution stability in the following way [46]: Consider 
i.i.d. random variables 7), 2, 73, . ... We say that the distribution of n is stable if 
for any n > 1 there exist real numbers an and b, > 0 such that 


d 
n = by(m c n) t as. 


In the sequel we present two main properties of random sequences. 


1.2.7. Orthogonal and Uncorrelated Sequences 


In this section we shall be concerned with the notions of orthogonal and uncorrel- 
ated sequences. We will present the connection existing between these important 
notions. 


Definition 14 (Orthogonal and uncorrelated sequences) 7e basic sequence 
(Xi, i = 1) is said to be orthogonal if 


E(Xi) «oo foralli>1 (1.19) 
and 

E(XiXj) 20 foralli £j 
The basic sequence is said to be uncorrelated if 

E{(X;)?} «oo foralli x1 (1.20) 
and 


E{X;Xj} = E(Xi)E(X;j). foralli £ j 


38 Stochastic Processes 


Remark 11 Jf E{X;} = 0 for all i > 1, it is immediately clear that the X; 
are orthogonal if and only if the X; are uncorrelated; that is, if the X; have zero 
mean, the theory of orthogonal random variables coincides with the theory of 
uncorrelated random variables. Centering uncorrelated random variables at their 
means preserves the property of their being uncorrelated [63]. 


Remark 12 (Noise in Identification) Jn process identification (input-output rep- 
resentations), the noise b(t) is usually assumed to be generated by a sequence of 
independent random variables e(t) of the form 


b(t) = C(q elt), 


where {e(t)} is a sequence of independent random variables with zero mean and 
finite variance'®, q^! represents the backward shift operator, and C (q7 }) is a poly- 
nomial of degree n. Even if {e(t)} is a sequence of independent random variables, 
the terms of {b(t)}, t = 1,2,..., are correlated. The process b(t) is stationary. 
Indeed, 


b(t) = coe(t) + cye(t — 1) +---+cpe(t — n) 
E{b(t)}=0, ELOD?) = Y c» 
i=l 
and 
b(t + j) 2 coe(t + j)+cielt+j-1)+---+crett+j-n) Vj 


n 


E(b(t + j) = E(b()) 20, E((b(t + Y) = E(b(0)?) = Y (ciY. 


i=l 
In the framework of prediction and minimum variance regulation [3, 31, 47], the 


term b(t), where the degree of the polynomial C (q -!) is equal tok — 1, represents 
the prediction error for which the variance increases with the delay k. 


Example 17 Show that a random variable X is independent of itself if and only 
if X — cte. 


Solution Observe that if X is independent of itself, we have 
E(X?) = E(X)E(X]. 


Let us denote by m the mathematical expectation E{X} of X, and calculate its 
variance 


E{(X — my) = E(X?) - 2mE(X) + m? 
= E(X?) - 2E{X}E{X} + E(X)E(X) =0. 


10 A white noise. 
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Finally, we obtain 
E((X — my) =0. 


This equality is verified for X — m = 0, X = m almost everywhere. 


Remark 13 (Independence and Correlation) Recall that the independence of 
two random variables implies noncorrelation, but the converse is not true. 


Example 18 Let &,...,&, be independent random variables and 
Emin = min(£j,. .., £5], Emax = max(£i,. .., 4]. 


Show that 


Pr(fmin = x) = | [Pr(& > x) 


izl 


Pr(€max < x) = [[Pré: « x). 


i=l 
Solution Observe that 


Emin Z x — §& > x, i=l,...,n 


Emax «x — & < x, i z]l,...,n. 
Consequently, we have 


Pr(&min = x) = Pr($i > x,..., 5. > x) 


Pr(Émax < x) = Pr(é) < x,... En < x). 
In view of the fact that £i, . . . ,&, are independent, it follows that 
n 
Pr(fmin > x) = Pr&i > x,...,& > x) = [[Préi > x) 


i=! 


Pr(€max < x) = Pr(Ei < x,.. -sn < x) =] [Pret < x). 


i=l 


The next example concerns a nonlinear transformation of independent random 
variables. 


Example 19 Let Xj, X? and X3 be independent random variables such that 


E[X;]=0 and E[xt] 2 i212. 
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Calculate the mean and the variance of 


X| + X2X3 
Ji xi 


Solution The mean of Y is given by 


Y= 


X1 + X2X3 
E[Y] = Ex,,x,,x; | -= 
J1- Xi 

X XX 

= E | -= | + £ | 2 


J1+x? J1+x 


Observe that X1, X2 and X3 are independent, and E[X;] = 0, i = 1,2,3. It then 
follows that 


X3 


"EP 


E[Y] = E[Xi]E + E[X2]E = 0. 


l 
yi+ xi 


Let us now calculate the variance: 


of = Var(Y) = Ex,x,x; Es 
-E EE | +2E nm 


pii ela 


Sela RE Hu = 
| 1+ XB 1-xi| 


Before ending this subsection, let us mention that in the framework of Kalman 
filtering, state estimation, model-based fault detection, etc., the innovation or the 
measurement residual that reflects the discrepancy between the predicted output 
and the actual measurement consists of a sequence of random variables which are 
orthogonal to each other and uncorrelated (white noise) [3, 31]. 


(X1 + X2X3)? 
1+ Xi 
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1.2.8. Convergence 


Let us consider again the game that consists of tossing a coin. The probability that 
the outcome will be tails is equal to 1. It means that if we toss the coin n times 
(for large n), we get tails (n/2) times. In other words, the probability — the relative 
frequency — associated with the event “tails” tends to l. Notice that there exists 
a difference between this notion of convergence and the convergence of the relative 
frequency associated with the event "tails". Indeed, even for large n, the relative 
frequency of an event can differ from its probability. 


In the deterministic case we say that x, tends to x* when we mean that the difference 
in norm between x, and x,..; becomes smaller and smaller as n increases. Consider 
the probabilistic framework, such as obtaining five when throwing a dice. We know 
that the probability of getting a five is equal to i . If we consider the ratio between the 
number of appearances of five and the total number of throws, this means that the 
ratio will tend to i when the number of throws is very large (as n tends to infinity). 
However, at some particular instant n (e.g., n — 10000) it is completely possible 
that in m throws the number 5 does not appear at all (with m equal to 10, 20, or 
134, for example). Hence, we see that we cannot define the convergence of random 
events as we do with deterministic phenomena. Owing to this observation, it is 
necessary to introduce other notions related to the convergence in the probability 
framework. 


Let {xn} be a sequence of random variables with a distribution function {Fn}. 
We say that: 


Definition 15 (Convergence in distribution) {x,} converges in distribution 
(law!!) to a random variable x with distribution function F(x) if the sequence 
(Fn (x)) converges to F(x) for any x. This is written as 


law 
Xn — x. 


Definition 16 (Convergence in probability) (x,) converges in probability to a 
random variable x if for any £, 5 > 0 there exists no(£, ô) such that 


Vn > no: Pr(| xn —x |> £) < ô. 


This is written as 


prob 
Xn — x. 


ll As pointed out by Doob, “This terminology is rather unfortunate, since the sequence of 
random variables may not then converge in any ordinary sense.” 
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Example 20 Consider a sequence {Xn} of independent random variables 
distributed according to the Cauchy distribution fx (x) = (1/n)(1/(1 + x?)), Le., 


dx 
(1 + x?) 


b 
Pr(a < X, < b) = if foralla <b. (1.21) 
a 


Let us introduce the following random variable Z, defined by 


1 
Zn = BXn, B = ne 


where & is some fixed positive constant. The probability density function (see 
Chapter 2) of Zn is given by 


Pr(Z, < z) = Pr(BX, < z) = Pr (x, < 2) 


m 
ri dPr(Z,<z) 1 
r nSz HS z 
= f fees gk G) 
—oo 


where fx(x) denotes the probability distribution function (pdf) of the random 
variable Xn. For n = 1,6, the pdf of the random variable Z, are depicted in 
Figure 1.15. Observe that the distribution of the random variable Z, resembles a 
hairpin and concentrates on zero as n increases from 1 to 6. We shall now prove 


Figure 1.15 Evolution of the probability distribution function as n increases 
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analytically that the sequence Z, converges in probability to zero as n — oo. 
For any € > 0, we get 


Pr(|Z,| > £) = Pr(|X,| > en?) = Pr(X, > en?) + Pr(X, < —en?) 
= Pr(en? < X, < +00) + Pr(—oo < X, < —en?). 


In view of (1.21) and using the equality (1/1) pre (dx/(1 + x*)) = 1, we derive 


1 +00 d 1 —en? d 
x x 
Pr(Z ces EN LL UEM 
r(| n| > €) = f tA az f (14 x2) 
end —oo 


—en? +00 +00 —en? 
PPP] 
end lini —oo +00 
—en?® en? 
| dx 
=l1+— .l=1— — — 0 as n —> oo. 
x a(l +x?) 
enê —£n? 


Definition 17 (Convergence almost surely) (x,) converges almost surely (with 
probability 1) to a random variable x if for any €, 5 > O0 there exists no(e,5) 
such that 
Vn > no: Pr(| xn — x |< £) > 1 — ô, 
or, in another form, 
Pr ( lim 1x, —x|< e) =1. 
This is written as 


a.s. 
Xn — X. 


Remark 14 The concepts of convergence in probability and convergence almost 
certainly give only information on the asymptotic behavior of the considered ran- 
dom variable. They give no information on the behavior on a finite horizon. 


Definition 18 (Convergence in quadratic mean) (x,) converges in quadratic 
mean? (in the mean squares sense) to a random variable x if 


lim E {Gn — x)" (n -2| =0. 


12 Similarly, the convergence in mean of order p is defined as follows: 
E {|xn — x|P} — 0. 
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This is written as 


q.m. 
Xn — x. 


Example 21 Let x and y be nonnegative random variables and p > 0, then 
E((x t y)?} x 2P[E{x?} + Efy?}}. 


This result shows that if x and y converge in mean of order p, then (x + y) converges 
also in mean of order p. This result follows directly by making use of the following 
inequality (see Lemma 3, Chapter 4, p. 95 in [12]) 


(a+b)? < [2max(a, b)]? x 2?[a? +b’), — fora and b » 0. 


The relationships between these convergence concepts are summarized in the 
following [16]: 


l. convergence in probability implies convergence in law; 
2. convergence in quadratic mean implies convergence in probability; 
3. convergence almost certainly implies convergence in probability. 


In general, the converse of these statements is false. 


Example 22 Consider a sequence of random variables {Xn}n>\ uniformly dis- 
tributed’? on the segment [0, 1/n]. Derive the asymptotic properties of Xn. 


Solution /t is clear that X, converges to X = 0. The probability density function 
of X, is 


poy ={" früsxsl 


0 otherwise. 


3 


13 One of the most important applications of the uniform distribution is in the generation 
of random variables. The uniformly distributed random variables are commonly used as 
sources random number generators (Rand, etc.) for numerical simulations. They also are 
transformed into realizations of other random variables for the same purposes. 


The probability density of a uniformly distributed random variable € [a, b] is given by 
1 
f= Bu fora<x <b 
0 forx <a, x >b. 
Notice also that the entropy, which is defined by 
H = —Eflog f (x)}, 


assumes its maximum value for uniformly distributed random variables. 
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Let us now calculate the mean of the sequence of random variables {Xn}n>1 


l1/n 1/n I/n 
min = Ell f xf) dx = [ efe 
0 0 0 
2 
x 1 
x | nb ere 
x=l/n 


The moment of order s > 1 is given by 


1/n ii 
E[IX, - mr] = f| | max 
0 
1/2n iP l1/n P 
=f EST nds f x— —| ndx 
0 1/2n 
1/2n E 1/n » 
- f (;-») LISECES n dx 
1/2n 
0 1/2n — — 
= —n f eas a EET where 2n 1 
1/2n 0 a EE 
1/2n 
= 2n fora 
0 
(1/2ny*! 1 


= 2n ——_—___ = ——_—_——_ — 0 
s+1 (s + 1)25n5 n—oo 
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This results can be expressed as follows: X, converges to 0 in the sth mean Ns > 1. 
Consequently, X, converges in probability and in distribution. The almost certain 


convergence will be considered in Chapter 4, Section 4.3. 


In what follows, we shall be concerned with finite Markov chains and 


martingales. 
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1.3. Finite Markov Chains 


Processes design is mainly based on modeling and simulation approaches. System 
simulation is the mimicking of the operation of real systems by a computer [1, 31]. 
We strive to build models not just for the fun of it, but to use the models for analysis, 
outcomes of which affect our decisions (strategies) in the future. 


The modeling problem can be tackled in many ways, but here we shall be 
concerned with the use of finite Markov chains in order to achieve this object- 
ive. Uncontrolled and controlled finite Markov chains have been under study for a 
long time, and from several points of view, in physics, mathematics, engineering, 
economics, and many other fields [8, 14, 19, 32, 37, 43, 51]. But the subject is such 
a fundamental and deep one that there is no doubt that Markov chains will continue 
to be an object of study for as long as one can foresee. In [40], based on Markov 
chains, a basic model for performance evaluation of a simplified radio network 
controller system has been developed. Notice that, under certain conditions, many 
stochastic algorithms such as iterated random functions, stochastic approximation 
techniques, simulated annealing, genetic algorithms, Kalman-Bucy filter, etc., can 
be mathematically modeled using Markov chains [6]. 


This section is devoted to uncontrolled finite Markov chains. We are going to 
consider the main results, relevant in many areas. The next subsection introduces 
mathematical concepts that are fundamental to the remainder of the text. We begin 
with a general presentation of the notion of state, and then specialize this treatment 
for Markov chains. 


Notion of State 

In this subsection our main emphasis will be on the introduction of the notion of 
state. Let us consider a room lit by an electrical system composed of an electrical 
power system and three light bulbs. This system is depicted in Figure 1.16. 


Figure 1.16 Lighting system 
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The behavior of this system and of each bulb, L;, i = 1,2, 3, can be described by 
the following states: 


x(i) = lor, i = 1,2,3,4. 


The first three states characterize the behavior of the three light bulbs and the fourth 
one represents the functioning of the lighting system. If a light bulb is in operation 
(out of order), its state takes the value 1 (0). The state x(i), which is associated with 
the system shown in Figure 1.16, takes values in the set (0, 1), where one (zero) 
corresponds to the presence (absence) of light. 


Hitherto, we have described an electrical circuit which is commonly used in the 
basic lectures on Boolean algebra. In more general terms, physical systems are 
designed and built to perform certain defined functions (separation, drying, chem- 
ical or biochemical reactions, etc.). In order to determine whether a system is 
performing properly, the engineer must know what the system is “doing” at any 
instant of time. In other words, the engineer must know the state of the system. 
In navigation, the state consists of position and velocity of the craft in question; 
in a distillation column, the state may be taken as concentration, power heating 
and reflux. 


Notice also that in order to deal with multivariable systems, the state-space 
representation has been introduced for a long period of time. In practice, the meas- 
urements (observations) are corrupted by noise. The problem of estimating the 
states of stochastic dynamical systems based on noisy observations of the states is of 
central importance in many engineering applications (signal processing, chemical 
processes, mineral industry, etc.). The main tool used for handling this estimation 
problem is the Kalman-Bucy filter!^. 


As will become increasingly apparent, we are now ready to define finite Markov 
chains, which play a fundamental role in many diverse areas of applications of 
stochastic processes. Let us first recall the Markov property: The Markov prop- 
erty simply states that the present state of the system completely determines the 
probability for the next step into the future. In other words, we have a fundamental 
one-step dependence. 


Definition 19 (Markov chain) 4 finite stationary Markov chain consists of a set 
X = (x(1),..., x(N)) of states and a probability transition matrix 


II = [pij], i,j=1,...,N 
K 


py € [0,1], >) ry =1@=1,...,N), 
j=l 


14 Interest in this problem dates back almost two centuries to the work of Gauss. Gauss was 
interested in determining the orbital elements of a celestial body from (many) observations 
and developed the technique that is known today as the least squares method. 
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where pij represents the transition probability from the state x; to the state x, i.e., 
pij = Pr[xs41 — x() | xa = xG)]. 

Here xn represents the state occupied at time n (n = 0, 1,.. .). 


Notice that the ith row (i = 1,..., N) of II represents the transition probability 
from the state x (i) to the states x(J) (j = 1,..., N). 


Remark 15 For finite controlled Markov chains, the probability transition matrix 
depends on the actions and is denoted by 


nce [2p hdd sa Nb ost 
Pk, = Prixngi = xG) | Xn = xG), us = u(k)), 


where the index k corresponds to the action u,(k) € U = (u(1),...,u(K)) selected 
at time n. 


The probability transition matrix is a stochastic matrix [48]. The transition prob- 
abilities and the initial distribution determine the distribution of the process. 
This result is stated in the following proposition. 
Proposition 1 (Transition probability) For a stationary Markov chain with 
matrix transition probability TI and initial distribution Pr[xo = x(ig)] = po(io), 
let us consider the following probability: 

Pr[xo = x (io), x1 = x(i1),.. -,Xn = x(in)}. 


In view of the definition of the conditional probability, we derive 


Pr[xo = x(ig), x1 = x(i1),..., xa = X(in)] 
= Pr[xn = x(in) | xo = x(ig), ...,Xn-1 = x(in—1)] 
x Pr[xo = x(ig),..., xn—1 = X(in-1)]. (1.22) 


Making use of the Markov property yields 


Pr[x, = X(in) | xo = x(io), ..., xa—1 = x(in-1)] 


= Pr[x, = x(ia) | xa-1 = x(n-1)] = Pi, is. 
By induction, we get 


Pr[xo = x (ig), x1 = x(i),..., xn = X(in)] 


= Di, jin Pin—rin-1 +- PG). 
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Let us consider the following example: 


Example 23 (Three-state Markov chain) Let us consider the following matrix 
transition probability: 


05 02 03 
II-2/|0.5 025 0.6 
0 0 1 


The schematic diagram (graph) of this three-state Markov chain is depicted in 
Figure 1.17, where the circles and the arcs represent the states and their connections 
(transitions), respectively. From this example, we observe that the states do not 
exhibit the same properties. As an example, if at any time the process is in state 
x(3), it remains there for all the time. 


Example 24 (Polymerization) The thermal polymerization process can be mod- 
elled as a discrete set of states in which the reactive end of a growing polymer 
consists of n-bonded monomers. The transition probability of incorporating the 
n + 1 free amino acid into the polymer is influenced only by the interaction between 
the incoming amino acid and the reactive n-amino acid at the reactive end of the 
polymer, and is not influenced by any other previous monomers n — 1, n —2, already 
bonded in the n-polymer. This process has been modelled as a Markov chain by 
Mosqueira et al. [45]. 


© 


aà ‘ 


a 


Figure 1.17 Schematic diagram of a three-state Markov chain 


S0 Stochastic Processes 


Example 25 (Reliability) Jn the system reliability theory, the system states and 
the possible transitions are illustrated by a state-space diagram, which is also 
known as a Markov diagram. An example of such a diagram is given in Figure 1.18. 
The various system states are defined by the states of the components comprising 
the system [4, 5, 30]. 


Consider the parallel structure of two components in Figure 1.19. The structure is 

functioning when at least one of the components is functioning. Since each of the 
components has two possible states, the parallel structure has four possible states. 
These states are listed in Table 1.1 [4, 5, 30]. 


Assume that the following corrective maintenance strategy is adopted: when a com- 
ponent fails, a repair action is initiated to bring this component back to its initial 
functioning state. After the repair is completed, the component is assumed to be as 
good as new. The possible transitions between the four system states are illustrated 
in Figure 1.18. 


The state-space method is not restricted to only two possible states of the compon- 
ents. The method can be used to model rather complicated repair and switching 
strategies. Common cause failures may also be modeled by the state-space method. 


The state variable of the system at time t is denoted by X (r). The system is assumed 
to start in a specified state, say state i, at time t = 0. The transition between states 


Figure 1.18 State-space diagram of a parallel structure of two components 


Figure 1.19 Parallel structure of two components 
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Table 1.1 Possible states of a system of two components 


System state State of State of Comment 
component] component 2 


3 1 1 Both components functioning 
0 1 Component 2 functioning, 
component 1 in failed state 
1 l 0 Component 1 functioning, 
component 2 in failed state 
0 0 0 Both components in failed state 


may be described by a stochastic process (X (t); t > 0). Many systems have trans- 
itions that can be approximately described by a stochastic process with the Markov 
property: Given that a system is in state i at time f (i.e., X(t) = i), the future states 
do not depend on the previous states X(t), t < t. In other words, when its present 
state is known, the probability of any particular future behavior of the process is 
not altered by additional knowledge about its past behavior. 


In the next subsubsection we shall be concerned with the characterization 
(classification) of states. 


Characterization of the States 

We see in the light of what was said above (see Example 23), that the states 
of a Markov chain can exhibit different properties. This section deals with the 
characterization (classification) of the states. 


Definition 20 (Accessible or reachable states) The state x (i) is said to be access- 
ible (reachable) from state x (j) if there is a positive probability that in a finite 
number of steps state i can be reached starting from state x( j ). 


Definition 21 (Communicating states) [fthe states x (i) and x (j) are accessible, 
and vice versa, they are said to communicate. 


Communicating states are denoted by 
x(i) € x(j). 


An equivalence relation allows us to regard two equivalent objects as being the 
same for some particular purpose. An ordering relation represents an equivalence 
relation if it is reflexive, symmetric, and transitive. The notion of communication 
is an equivalence relation, i.e., 


1. Reflexivity: 


x(i) => xi); 
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2. Symmetry: 
if x(i) e x(j), then x(j) e» x(i); 
3. Transitivity: 
ifx(i) e x(j) and x(j) <=> x(k), 
then 
x(i) €» x(k). 
Definition 22 (Recurrent and transient states) A state x(i) is said to be recur- 
hid if and only if starting at i there will be a return to i with probability 
Pr[x, = x(i) i.o] 21 
otherwise the state is said to be transient, i.e., 

Pr[x, = x(i) i.o.] = 0. 
If x (i) is a recurrent state and n; is the average length of steps (time) required to 
return to x (i) when the initial state is x (i), then x(i) is said to be recurrent null if 


nj = œ, and recurrent positive if n; is finite (n; < oo). 


Definition 23 (Absorbing states) A state i is said to be absorbing if pi; = 1 
(when the chain enters in this state and stays there forever). 


It is clear from this definition that if a Markov chain contains s absorbing states, 
then it can be partitioned as follows (eventually by renumbering the states): 


"n Lu P 
n-[' o]. 
where I, x, is a unit matrix (the absorbing states are reordered from | to s). 


Owing to the fact that 
K 
ZOLL S 
j=l 


the components of the matrix P are equal to zero, i.e., 
P(n-s)x(N-s) = 0 


l5 i o. means infinitely often. 
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and the transition matrix has the following form 


Ix; 0 
n= [s 8]. 


The components of the matrix Q represent the transition probabilities among the 
transient states of the Markov chain considered. For an absorbing Markov chain, 
the matrix (I — Q) is called a fundamental matrix [37]. This matrix has an inverse 


oo 


A-Q =I+Q+Q+ -=J oe. 


The mean of the total number of times the considered Markov chain is in a given 
transient state (the expected numbers of visits to a given state j when the chain 
starts in state i) is given by the fundamental matrix (see for instance Section 3.2.4, 
p. 46 in [37]). 


Definition 24 (irreducible Markov chains) An irreducible Markov chain is one 
in which all states intercommunicate. 


The states are divided into equivalence classes. Two states belong to the same equi- 
valence class if they communicate. The transient and non-transient states constitute 
the transient and ergodic sets respectively. The states of an irreducible Markov chain 
constitute a single closed set and all its states are of the same type. 


Definition 25 (Closed subset) A closed subset A of the states of a Markov chain 
is a set where all the states are communicating, and no state x(i) belonging to A 
communicates with a state x(k) ¢ A. 


Remark 16  /f a closed set is constituted only by one state, then this state is an 
absorbing state. 


Remark 17 A closed communicating class of states constitutes a sub-Markov 
chain (a Markov chain), which can be studied separately. 


This definition of closed sets is different to the notion of a closed set in standard 
mathematical analysis. Indeed, a subset of the real line is closed if it constitutes 
a segment, i.e., it contains all of its limit points. 


We shall give a mathematical characterization ofthe states of Markov chains. Let us 


assume that the chain is in state i at time 0 (initial time), and consider the following 
probability: 


try(i,i) = Prix,  x(i,k—1,2,...,n — l; x, = xi) | xo = x(i)], (1.23) 
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which represents the probability that the chain will return to the initial state at the 
nth transition. The probability tr, (i, i) is said to be the first return probability at 
time n. 


The state j is said to be recurrent (the return to this state is a certain event) if 


oo 
A) =1 
nzl 


and transient (the return to the considered state has a probability less than 1) if 


oo 
Stra (i, i) <1. 
n=l 


In other words, a state is recurrent if and only if it is revisited infinitely often 
with probability 1, and transient if and only if it is revisited finitely often with 
probability 1 [14]. 


When the state is recurrent, the mean of the probability distribution (t7,(j, j), 
n= 1,2,...} is given by 


oo 
pj = 3 nltra, I. 


n=l 
The recurrent state is said to be positive if 


pj 20 


and null if 


pj. = 0. 


Now we shall deal with the first passage probability from state x(i) to state x(k) 
at time n: 


tra(i,k) = Prix A x), k 21,2,...,n—l;xQ.—x(k)|xo-xG). (1.24) 


The mean first passage time from state j to state k is given by 


oo 


t = } nlt GE). 


n=} 


The Markov chain is said to be periodic if the subsequent occupations of state j 
occur at times 7, 2n, 3n, ..., where 7 is an integer greater than 1. The period of the 
Markov chain is equal to the largest value of n, i.e., the greatest common divisor 
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of the numbers n for which tr, (j, j) > 0. Otherwise, the Markov chain is said to 
be aperiodic. 


We have presented a classification of states on the basis of the asymptotic properties 
ofthe transition probabilities tr, (j, j). We can expect that this classification may be 
reflected by the transition matrix itself, i.e., the properties of transition probabilities 
tra (j,k). Indeed, the classification of states can be carried out from the investigation 
of the transition matrix [14, 37, 51, 60]. 


Let us consider a Markov chain with the following matrix transition, which consists 
of a juxtaposition of matrices (partitioned matrix): 
0.2 03 0.5 0 0 
01 0.5 0.4 0 0 
m 0 
II-|07 0.125 0.175 0 0 j= . 
0 IL 


0 0 0 0.65 0.35 
0 0 0 0.25 0.75 


The graph (schematic diagram) associated with this Markov chain is depicted in 
Figure 1.20. 


o eo 


N 
| \ 


(x(4}) EE ) 


Figure 1.20 Graph of Markov chain with two indecomposable classes 
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This figure shows that we are dealing with two specific sets of states, namely, 
(x (1), x(2), x (3)) and (x(4), x (5)). These sets correspond to the matrices II, and 
II; respectively, and correspond to what are called indecomposable classes. The 
analysis of their properties reduces to the analysis of two separate Markov chains 
with transition matrices II, and M2 respectively. The eigenvalues! of this prob- 
ability transition matrix are: —A| = 356.45, Az = 1.0, 43 = 0.231, A4 = 0.4 and 
Às — 1.0. 


As mentioned before, the characteristics of a Markov chain depend on the properties 
of the transition matrix, and more precisely on the eigenvalues of the transition 
matrix. In what follows, we shall relate the Markov chain properties (properties of 
the states) to the eigenvalues ofthe transition matrix. Indeed, the eigenvalues can be 
considered as the “X-ray photography" ofa given matrix. Using two array elements 
and two independent noise sources located in the far field, Testa and Vannicola 
[64] have described the physical significance of the eigenvalues associated with an 
adaptive antenna array. They show that the eigenvalues of the covariance matrix 
of the array represent linear combinations of the power of the independent noise 
sources as the antenna array output channels see them. 


In what follows, we shall present some useful theorems for the investigation of 
Markov chains. 


Theorem 1 (Multiplicity of roots) The multiplicity of the roots of the transition 
matrix is equal to the number of irreducible subsets of the Markov chain. 


The proof of this theorem will not be given here. The interested reader is referred 
to [32, 35]. 


Let us investigate the behavior of the Markov chain characterized by the following 
transition matrix: 


0 0 0.55 0.45 


n-.|9 © 02 os| [6 m 
=109 oi 0 0]|-^ 


0.35 065 0 0 
16 The eigenvalues (latent or characteristic roots) of an N x N matrix A are the roots of the 
characteristic equation, i.e., 
det(A — AD) = 0. 
The eigenvectors are the solution of the following equation: 


Ax = Xx, dimx = N. 
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a &) 


Figure 1.21 Markov chain with cyclic property 


The graph (digraph) of this Markov chain is shown in Figure 1.21. This chain 
consists of two cyclic!” subclasses: {x(1), x(2)) and {x (3), x(4)}. 


To geta fuller idea of the behavior of this Markov chain, let us consider the following 
tree representations (Figures 1.22 and 1.23) in which the transition probabilities are 
assigned to the branches, and the states are represented by circles. These figures 
show clearly: 


1. that the return to each state is accomplished only after an even number of 
transitions; 
2. the cyclical characteristic of this Markov chain. Indeed, the period is equal to 2. 


The eigenvalues of this matrix are: à} = 1.0, Az = —1.0, A3 = 0.43875, 
A3 = —0.43875. We have two eigenvalues with absolute value equal to 1 
({Ai| = |A2| = 1). Then the period of this transition matrix is equal to 2. 


Definition 26 (Ergodic Markov chain) A Markov chain is said to be ergodic if 
the limits 


mS 
dim = 


exist for all j independently of i, and 


17 Ifthe system is in a state belonging to the set {x(1), x(2)}, it moves to one of the states 
{x(3), x(4)} in one step, and returns to a state in {x(1), x(2)} (the starting set) the successive 
step. These comments are also valid if the system is initially in one of the states {x (3), x(4)}. 
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Figure 1.22 Tree representation with initial state x (1) 


In other words, an irreducible and aperiodic Markov chain is said to be ergodic 
if there exists a stationary distribution or invariant (the distribution over states for 
all subsequent time remains constant and does not depend on the initial conditions) 
x such that 


lim II^ = 1x, yx =, (1.25) 


where 
mu =([7,...,7N]. 
If we have an initial distribution equal to x, we easily get 
zlli-z. (1.26) 


This relation can be interpreted in the framework of matrix calculations. Indeed, zr 
corresponds to the left eigenvector!? of the transition matrix which is associated 
with the eigenvalue 1. 


18 The left eigenvalues of an N x N matrix A are given by the nontrivial solution of 
XA = Xx, 


where x represents the left eigenvectors of A. 
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Figure 1.23 Tree representations for initial states x (2), x(3) and x(4), respectively 


Example 26 (Stationary distribution) Letus consider the stationary distribution 
x associated with the following transition matrix: 


06 04 0 
n-j04 0 0.6 
0 025 0.75 


The stationary distribution m is given by the solution of the following equations: 


0.61; + 0.472; = 1; — m =N] 
0.47, + 0.2573 = m — m = m 
0.672 + 0.7573 = n3 => 2m 


mitm +m =l, 
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which lead to 


Now we are interested in the probability of moving from a given state i to another 
state j in n transitions. Observe that the Markov chain can move from one state to 
another in many ways. This probability is associated with the collections of paths 
that go from state i to state k in m transitions and then move from state k to state i 
in (n — m) transitions: 


N 
nt m, n-m 
ri = 1 Tuy. 
k=1 


This equation is called the Chapman—Kolmogorov equation. It corresponds to the 
power of the transition matrix, i.e., II". This matrix can be calculated on the basis 
of the eigenvalues (eigenvectors), namely the spectral decomposition, which can 
be formulated as follows: If a given N x N matrix A has N linearly independent 
eigenvectors x; (Ax; = A;x;), then it can be decomposed in the following form 
(spectral decomposition) 


A-L' AL, (1.27) 


where A is a diagonal matrix with elements A: 


A; 0 0 

A= 0 2 
Wo ud 
0 0 An 

and 

T 

X 

T 

L=|~ 

T 

XN 


From (1.27), we easily derive 


A'"—L^!A"L 
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with 
A0. 0 
ana | X : 
P D a . 0 
0 o> 0 A 


Example 27 Let us consider again the probability transition matrix of 
Example 26. Calculate TI^. 


Solution The eigenvalues and the eigenvectors of TI are 


Ay = 665.54, Az = —0.315.54, 24 = 1.0 


and 
—0.890 16 0.633 11 0.39461 
—0.145 84 |, 0.633 11 and —0.903 2 
0.431 67 0.633 11 0.21191 


respectively. The spectral decomposition of this matrix gives 


0.633 11 0.63311 0.633 11 
0.39461 —0.9032 0.21191 


l E 0.358 98 sem] 
Lc = 


E —0.145 84 ZH 
L= 


—0.11567 0.35898 —0.836 87 
0.82166 0.86155 0.47124 


and 


(665.54)4 0 0 
n^-L^ 0 (-0.315545 0 |L. 


0 0 1.0 


Hence 


—2.7176 x 1072, 2.16 x 107? 0.365 63 
H^-|-44525x10?! 2.516x1077  —0.83687 
3.1629 x 102? 6.0383 x 1077. 0.47124 


When the eigenvectors are not linearly independent, the Jordan canonical form 
decomposition can be used to simplify the calculation ofthe powers ofthe transition 
matrix [26, 29]. 


To end this subsection, let us mention another use of Markov chains which concerns 
sampling [25]. Indeed, in many applications it is useful to sample from a finite set 
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of objects in accordance with some distribution. One approach is to run an ergodic 
Markov chain whose stationary distribution is the desired distribution on this set. 
In other words, the distribution governing the states of the chain approximates the 
desired distribution [52]. 


1.4. Renewal Processes 


We shall now introduce some functions which are directly related to the probability 
distribution function F (t), namely the reliability function R(t), the hazard function 
p (t), and the cumulative hazard function which represent quantitative measures in 
different areas such as reliability, life data analysis, management, economics, etc. 
Recall that the reliability function, which was introduced in Section 1.2, is given by 


R(t) =1— F(t). (1.28) 


It is also known as the survival function, and represents the probability that a device 
does not fail in the time interval (0, r]. 


The hazard function p(t) represents the conditional probability of failure during 
a very small time increment under the assumption that no failures have occurred 
prior to that time. 


_ ft) 
elt) = EG (1.29) 
The term 
_ f()dt 
VI 


represents the conditional probability of failure within the interval t and t 4- dt, 
given that the system is in operating state at time 1. This function, which is a trans- 
formation ofthe survival function R(t), is also called the failure rate, the conditional 
failure, the intensity, or the force of mortality function. Some examples related to 
the calculation of the hazard rate are given in Chapter 2. 


This function has been initially introduced for modelling optimization problems 
concerning maintenance and replacement of unreliable machines [34]. In fact, even 
though probabilities are involved, this optimization problem becomes a determin- 
istic one. Theuse of hazard function has been extended to other modelling problems, 
as for example those involved in management sciences and economics [59]. 


The cumulative hazard function is defined as follows: 


H 


pc(t) = fona. (1.30) 
0 
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From (1.29) and (1.30), it is easy to derive the following identities: 


t 
F(t) 21-—exp (- fo s) 


0 


.. dinR() 
p(t) = T 


pc (t) = — In R(t). 


Example 28 Let us verify the first identity, F(t) = 1 — exp (- IA p(t)d r). 


Solution By definition, 


f(t) 


90-17 FD 


Observe that 


fdt 5 d(1 — F(t))dt 


1-F() _ 1 — F(t) 


|| [40- FG) 1 
[»^0«- | I— F() =1(—F5) 
1 
rro 7 (f ^9) 
F(t) —1- exp (- fo). 


p(t) dt — 
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The degradation law ofa given system can be entirely defined by one ofthe functions 
defined above, i.e., f (t), p(t), F(t) and R(t). Table 1.2 gives the relations between 


these functions (we denote Y(t) := exp(— h p(x) dx)). 


Table 1.2 Correspondence between f (t), p(t), F(t) and R(t) 


F(t) R(t) f(t) p(t) 
dF(t) dF(t)/dt 
th li ST dt 1- F(t) 
dR(t) dR(t)/dt 

R(t) 1— R(t) E a7 ES 

t 
ft) | ffoe)dx ffad - © 

? fG)dx 


| ~“Ag 


p) 1- Ye) Y) POTE) 
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Renewal processes play an important role in many engineering areas (risk assess- 
ment, queueing systems, inventory management, spare parts, business!?, etc.), and 
especially in reliability theory [4, 5, 12, 35, 55, 62]. The theory of renewal processes, 
when applied to a repairable system, permits the determination of the number of 
failures until a replacement occurs, in the interval of time (0, t). Mathematically, a 
renewal process is a sequence of independent identically distributed positive ran- 
dom variables, which are not all zero, with probability 1. At the initial time, the 
system is assumed to be in the operation mode. 


Definition 27 (Renewal process) Let X1, X2,... be independent identically 
distributed positive real-valued random variables, and define the partial sum 


Sn =X) + Xo oco Xn. 
Then the stochastic process 
(S1, 82,..., Sn} 


is said to be a renewal process. The number of renewals (replacements) in the 
interval of time (0, t) is denoted by 


N (t) = max(n: S, < t), t» 0. 


In other words, it is clear that a renewal process is an arrival counting process. 
The developments that follow are made in the framework of reliability. 


When a failure occurs, let us assume that: 


l. the failed item is replaced, or repaired, in such a way that it will be as good as 
new; 

2. the repair time is negligible; 

3. the successive lifetimes are independent random variables distributed 
according to the probability density function f (t). 


From the definition of N (t), which represents the number of renewals when a failure 
occurs in the interval (0, t), we get 


Pr(N (t) < n) = Pr(S, > t) = Pr(Xi + X24+---+ X4 » t). 


It is well known that the probability density function of a sum of independent 
random variables is given by the convolution [5, 36]. It follows that 


fs,—f*tf*exfszf9. 


n times 


19 The customers arrive according to a renewal process. 
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Let us denote by F(t) the following probability: 


t 
Pr($, < t) = f f? (x) dx = Fs,() = F", 
0 


and we have that 
Pr(N(t) < n) = Pr(S, > t) = 1 — F” (t). 
The probability of having n renewals in the interval of time (0, t) is given by 


Pr(N (t) =n) = Pr(N(t) « n 4+ 1) — Pr(N(t) <n) 
-1-F't()-(1- F"(r) 
= F'(t) — F^* (t), 


Next, we shall introduce the renewal function, which is defined as the expected 
(mean) number of renewals, i.e., 


oo oo 
M(t) = E(N()) = 9 n Pr(N() =n) = Do nC" n) — F"* (n) 


n-l nzl 
oo oo 
= Pnr") = Dla +1) — 1]F"+! (0) 
n=l n=) 
oo oo P oo 
- yo = > n' F" (t) — > Ft!) 
n=l n'=2 n=l 


oo eo 
-F(0-' F” (t) = y ro. 


n'=2 n=l 


The renewal function generated by F (t) can be derived on the basis of the following 
recursive form 


oo oco 
MQ) => F'o-FO-Y FO 


n=1 n-2 


c t 
=F(t)+)> f F"—! (t — x) dF (x) 


n=2 0 
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t oo 
= F(t) «fX F" (t — x) dF(x) 
0 n=l 


t 
=F0+ [ Me - ar) 
0 


t 
= f" — M(t — x)]dF (x). (1.31) 
0 


Expression (1.31) corresponds to the fundamental renewal equation [5]. 
The Laplace transform is commonly used to solve this equation. Indeed, the Laplace 
transform of the convolution f * g is given by 


Lif * 8] = Lif Mig). 


Indeed, 
t 
£[M(0] 2 £ |o + f we — TT 
0 


= LIFO] + EDIOVELE OY 
Recall that from 
LIFO = LS) 
It follows that 


1 £ 


The inverse transform of C[M (t)] gives M(t). However, in most cases a Laplace 
transform cannot be analytically inverted, therefore numerical methods have to be 
used [33]. 


Example 29 Calculate the renewal function associated with the exponential law. 
Solution The exponential law is given by 


f(t) =A exp(—-At). 
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Its Laplace transform? is equal to 


LIA exp(—Ar)] = EER 


which leads to 


À 
£[M(t)] = a 


and consequently gives 


M (t) 2 At. 


Remark 18 For an exponential process, the number of renewals in an interval 
of length At, is equal to 


M (t + At) - M(t) =AAt. 
A represents the failure rate. Observe that 


: A 
lim [M(t + A) - MD] 2 Aat = — = = ; 
t—00 1/A mean 


Blackwell has proved that this result is general. He formulated this result in the 
well-known Blackwell Theorem [5, 22]. 


The renewal function associated with the normal distribution is given by 


M(t) 3 1 Í ex 1 (z 2-3] d 
z — -> x. 
ap0v2n EN Pla oyn 


The average of the renewals occurring in the interval [t1, t2] is given by 


Nan = M(t2) - M(t). 


A renewal process is called ordinary if its inter-renewal times are strictly positive. 
Samuels [57] has proved that if the superposition of two ordinary renewal processes 
is an ordinary renewal process, then all processes are Poisson. This result has been 
generalized by Ferreria [24] to the case of processes whose inter-renewal times 
may be zero. He shows that, besides the Poisson processes, there are two pairs of 
binomial-like processes whose superposition is a renewal process. 


20 In general, if the Laplace transform is rational, i.e., consists of polynomials ratio, the 
inverse transform can be calculated by the Heaviside inversion formula, which consists of: 
i) calculate the roots of the denominator; ii) factor the denominator; iii) find the partial 
expansion; and iv) invert by using the Laplace transform table. 


68 Stochastic Processes 


Notice that the renewal density m(t) defined by 


adM(t) 


OT 


verifies equation (1.31), i.e., 
t 
m(t) = f(t) E — x)dF (x). 
0 


The cost of corrective maintenance operations on units is expressed as a function 
of renewal function. An example related to the preventive maintenance on systems 
consisting of non-identical units, with compulsory corrective repair of units upon 
failure, is given in [58]. The preventive maintenance on unit j is done every k;T 
time units. The derivation of the maintenance strategy leads to a formulation of 
a mixed integer programming problem involving the real argument T and the 
integers kj. This problem can be solved using several techniques (see Chapter 3). 
A replacement model procedure for a series system of independent non-identical 
components was developed by Molina and Miller [44] on the basis of the average 
number of subsystem failures, and applied to the Greater Manchester Passenger 
Transport Executive bus data. 


A remark is in order here. 


Remark 19 The term "renewal" is also used in the framework of Markov chains 
(renewal decomposition) [8]. 


The concepts underlying martingale theory will be explored in what follows. 


1.5. Martingale, Supermartingale, Submartingale 


Stochastic processes as martingales have extensive applications in stochastic 
problems. They arise naturally whenever one needs to consider mathematical 
expectations with respect to increasing information patterns. They are used to state 
several theoretical results concerning the convergence and the convergence rate of 
learning systems (derivation of the asymptotic properties of recursive algorithms). 
The martingale, supermartingale and submartingale processes are defined in 
what follows. 


Definition 28 (Martingale) A stochastic process {x,} is a martingale if it is 
uniformly integrable 


E(| xn |} “< oo. (1.32) 
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and for any n = 1,2,... 


E{xn+1 | Fa} = xg. (1.33) 


Definition 29 (Supermartingale) 4 stochastic process {xn} is a supermartingale 
if it is uniformly integrable and 


a.s. 
E{xn+1 | Fn} € xn. 


Definition 30 (Submartingale) A stochastic process {xn} is a submartingale if 
it is uniformly integrable and 


Elina | Fa} > Xn. (1.34) 


The name “martingale” derives from a French acronym for the gambling strategy 
of doubling one’s bets until a win is secured. Let x, be the player’s fortune at stage 
n of a game. The martingale property captures one notion of a game being fair in 
that the player’s fortune on the next play is, on average, his current fortune and is 
not otherwise affected by the previous history. 


Remark 20 Note that {xn} is a supermartingale with respect to Fn if and only 
if {—xn} is a submartingale. Similarly, {xn} is a martingale with respect to Fn if 
and only if {xn} is both a submartingale and a supermartingale simultaneously. 


This means that statements about submartingales can be transcribed into equivalent 
statements concerning both supermartingales and martingales. 


At the present time, martingale theory has such a broad scope and diverse domain 
of application in general probability theory and mathematical analysis that to think 


of it purely in terms of gambling would be unduly restrictive and misleading. Some 
examples will be given next. 


1.5.1. Examples of Martingales 


The following examples are selected in order to demonstrate the immense variety 
and relevance of martingale processes. 


Example 30 (The POLYA urn) Let us show that the process T, defined in the 
POLYA urn example (see Example 3) is a martingale. 


E[T, | Ta-1,..., 711 = EIT4 | T4] 
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bn-1 
E! Tn | Ta-1 = | 
| i : bn-1 + rn- 
» bn- +a bn-1 + bn-} Tn-1 
bn-1 +a + rn-1 bn-1 + Fn- bn- +rn-1 +a bn-1 +rn-1 
— amam —— 
Probability Probability 
= bn-1 i 
bn-1 + rn- ee 


Thus, {T,,n > 1} is a martingale. 


The sums of independent random variables are useful in many areas and have been 
the subject of many studies. The main theoretical results concerning these sums are 
reviewed in [39]. 


Example 31 (Partial sum) Let X, X2,...be independent random variables each 
with mean u. Let Sy = 0 and for n > 0 let the S, be the partial sum: 


Sn = Xi ++ Xn. 
Show that 
Mn = Sp — npu 
is a martingale with respect to Fn, the information contained in Xo, X\,...,Xn. 
Solution Based on Example 13, we derive 


E{Mn41 | Fn} = E{Sn41 — (n + Du | Fn} 
= E{Snt1 | Fn} (n+ Du 
= (S+ u)— (n+ I) = Mr. 


In particular, if u = 0, then Sn is a martingale with respect to Fn. 


Example 32 (Sums of independent random variables) Let Xo = 0 and 
X1, X2,... be independent random variables with E[|X,]] < oo and E[S,) = 0 
for all n. If Sy = 0 and Sn = X1 +---+ Xn for n 2 1, then {Sn} is a martingale 
with respect to {Xn}. 


Solution On property (1.32): 


E[IS41] S EUX] +--+: + EIX] < oo. 
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Now we have to verify property (1.33). We have 


E[Sn41 | Xo,..., Xa] = E[Sn + Xn41 | Xo,..., Xn] 
— E[S, | Xo,..., Xa] + El X441 | Xo, ..., Xn] 
(1.35) 


The independence assumption on (X;) leads to 
E(Xn+1 | Xo, ..., Xn] = E[Xn+1]. 
Then, from (1.35) we derive 
E[(Sn+1 | Xo,..., Xn] = ElSn | Xo,..., Xn] + E[Xn+:]. 
Since E[X,] = 0, by stipulation, we get 


E(Sn41 | Xo,.. -» Xn] = E[Sp | Xo... Xn] 
= S. 


The next example shows the connection between estimation and conditional 
mathematical expectation [60]. 


Example 33 (Optimal estimator) Let £ and n be two correlated random vari- 
ables such that & is observable, and denote by q(&) an estimator of n. Assume that 
E{n?} < oo. Then there is an optimal estimator of n, q* = Q* (E), and q* (x) can 
be taken to be the function 


go" (x) = Ein | § =x}. 


Solution Without loss of generality we may consider only estimators q(&) for 
which 


E(g^(£)) < oo. 
Then if q(E) is such an estimator, and 
eG) = Efn | § =x), 
by adding and substracting y* (E) from both sides of this equality, we get 
E [tn - e| = E [tm - e*&» 9 GO - ee» 
- E [m - ec] +e [ie - ee] 
+ 2E((n — ENEE) - e). (1.36) 
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Taking into account the properties of conditional mathematical expectation, 
the double product on the right side of (1.36) leads to 


E{(n — 9" (£))(e* E) — €(6)) = E(El( — e'* C) (o* (6) — 9601 1 £) 
= E{(y*(€) — e£) Eln — 9'6)1£]) 
= 0. 


Since 
E [iE - ef] = o. 
we derive 
E [im - e] > E [tm - eer]. 


This inequality shows that any other estimator leads to a variance of the prediction 
error greater than the variance associated with the optimal estimator. 


Remark 21 The previous example corresponds to Theorem 1, Chapter 2, 
Section 8, in [60]. 


Next, we shall consider a more general sum and its connection with martingales. 
Example 34 (More general sums) Suppose Zi = gi(Yo,...,Yi) for some 
arbitrary sequences of random variables Y; and functions gi. Let f be a function 
for which 

E(| f(Zx)|] < oo, fork =0,1,.... 


Let ay be a bounded function of k real variables. Then 


Xn = Df (Ze) — EUG | Yo... Yiillak(Yo, Ys | 
k=0 


defines a martingale with respect to {Yn}. 
Solution By convention, 

EIS (Zi) | Yo, ..., Y.-1] = Elf (Za) 
when k = 0. Since ay is bounded, say 


lak (yo, -- -+ yYk-1)| S Ak, for all yo, ..., ye-1, 
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we have the property (1.32): 
n 
E(\Xnl] < 29 AELS ZOI] < oo. 
k=0 
Let 
By = {f (Zk) — ELS (Zk) | Yo... , Yk—1]}ak(Yo, ..., Yk—1). 
Let us recall that Zy = gy(Yo, ... , Yx) and 
E[g(Yo, ..., Yk) | Yo,..., Yk] = g(Yo,..., Yk). 
It follows that 
E[B, | Yo, ..., Yy 1] = 0. 


Thus, 


n 
E[X, | Yo,..., Yn-1] = E |x^ | fof 
k=0 


= E[Xn-1 | Yo,..-,Yn-1] + E[Bn | Yo,. .., Yi] 
= Xn-1, 


which establishes the martingale property. 


In what follows, we shall consider a Markov process [35]. 
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Example35 Let Yo, Y;,... bea Markov chain process with transition probability 
matrix IHiyxw = [pij]. Next, let f be a bounded nonnegative sequence such that 


N 
fai) pfo. 


j=l 
Then, the process X, 
Xn = f (Yn) 
is a martingale and 


E{|Xnl} < oo. 


Solution The process X1, X2,... is uniformly integrable, i.e., 


E(IX4l) < oo, 
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since f is bounded. The conditional mathematical expectation of X n+ is given by 
E{Xn+1 | Yo, Y;,. s -> Yn} = E{f(Yn+1) | Yo, Y;,. «+s Yn}. 
In view of the Markov property, we derive 


E{Xn+1 | Yo, Y. s Yn} = E{ f (Yn41) | Yn} 


N 
=} pif) = fn) = Xn, 
j=l 


which shows the induced martingale. 
The next example deals with the variance of a sum. 
Example 36 (Variance of a sum as a martingale) Let Yo = 0 and Y, ¥2,... 


be independent identically distributed (iid) random variables with E[Y,] = 0 and 
E[Y2] = 07, k = 1,2,.... Let Xo = 0 and show that 


‘i 2 
Xnr = » n) — no? 
k=l 

defines a martingale with respect to {Yn}. 
Solution From the property 

ja +b| < lal + Ibl 
of the operator |-|, we derive that X, is uniformly integrable 

E[|Xn]] < 2no?. 


Now, let us calculate the following conditional mathematical expectation: 


2 

n 

E[Xn+1 | Yo. Yn] = E (s. 2 - (n+ Do? | Yo,. .., Ys 
kzl 


2 

n n 

=E atya DnD n) — (n + Do? | Yo,..., Y, 
k=1 k=1 


n 
=X, -E[Y2, | Yo... Yn | + 2EUYn41 | Yo... Yn] » xa C Ye 
k=l 


In the next example we will show that any centered random variable can be 
considered as a sum of martingales. 
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Example 37 Any centered random variable X, (E[X,] = 0) is a sum of marting- 
ales Xy Ik ie., 


n 
Xn => Klg Xia = ElXn | Fe] — ElXn | fi-il. 
kzl 


Solution t is evident if we note that 
E[X, | Fo] - E[Xn]=0 and. E|Xt 1 Fin] =0. 


Now we shall introduce Doob's martingale process. We will show that under some 
conditions the conditional mathematical expectation of a given random variable 
constitutes a martingale. 


Example 38 (Doob's martingale process) Let Yo,Y;,... be an arbitrary 
sequence of random variables and suppose X is a random variable satisfying 
E[|X|] « oo. Then 


Xn = E[X | Yo,..., Yn] 


forms a martingale with respect to {Y,}, called Doob s Process. 


Solution Let us first show that condition (1.32) 
E[|Xn|] < co 
is fulfilled 
E(|Xnl] = E(LEIX | Yo,..-, YnIl} 


< E(E[IX| | Yo, ..., Yn)} 
= E[IXI] < oo. 


Now we shall deal with the second condition (1.33). By the law of total probability 
for conditional mathematical expectations?! , 
E[Xn+1 | Yo, ..., Yn] = E(E[X | Yo,..., Yn, Yn+1] | Yo,- - -> Yn] 
= E[X | Yo,..., Yn] = Xn, 


which completes the proof. 


21 The law of total probability for conditional mathematical expectations extends the usual 
law by introducing further conditioning on a random variable Z. The law states that 


E[X | Z] = E{E[X | Y,Z] | Z}, 


valid whenever E[|X|] < oo. 
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In the next example, we shall construct a submartingale from a martingale by using 
convex functions. 


Example 39 (Convex functions) Let {X,} be a martingale with respect to {Yn}. 
If ó is a convex function for which 


E [6(X,)*] < oo 


(see??) for all n, then {@(Xn)} is a submartingale with respect to {Yq}. In partic- 
ular, (|X5|) is always a submartingale and {IX e ? is a submartingale whenever 


E (xi) < oo for all n. 
Solution We need only show the submartingale inequality; the other proper- 
ties can be easily proved. By making use of Jensen inequality (see Appendix A), 


we derive 


E($(Xn+1) | Yo, ..., Yn] 2 6(E[Xn&i | Yos...» Yn) 
= ó(X,). 


The decomposition of a submartingale into a martingale and a positive predictable 
increasing sequence is stated in what follows [16]. 


Example 40 (Doob decomposition) {X,,} isa submartingale with respect to {Fn} 
if and only if there exists a martingale (X;) with respect to {Fn} and a positive 
increasing predictable sequence {Zn} such that 


Xn =Z, +X. 


Solution Let {Xn} be a submartingale and 
Yea = Xie — ElXk+ | Fel, Yi Xi (1.37) 
and 
Vk+1 = —Xkc E[Xk+1 | Fel, Vi =09, (1.38) 
then (1.37) leads to 
E[Yk+1 | £4] 2 0 


and Vy44 > 0 since (X,) is a submartingale (see 1.34). Then 


n-l n n 
X, 2 Xi Y (X-X) =Y +) Y+} V= Za Xs 
k=l k=2 k=] 


22 The notations x ^ and x7 are defined as follows: 


xt = max(0,x) and x^ = min(x,0). 
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where 


n n 
x. 2X and Z, = ye 
k=l k=l 
Now 


n n 
E[Xn+1 jeg [Yn ens EI =) Mex) 


k=1 k=1 


Hence (X; ) is a martingale and Zp is obviously increasing. To show the converse, 
we simply note that 


E(Xn+1 | Fn] = E[Zn41 t XL | Fil = Zn t x > Zn + X, = Xn 
since (Z,) is increasing. 


We shall end this subsection by the following lemma, which is useful for the analysis 
of some problems related to estimation problems [11, 63]. 


Lemma 1 Let U, be a supermartingale with respect to the o -algebra Fn with 
E(U1) 20, and Ug — 0. 
Let 
xj = U; — Uj-1, i>. 
Assume that 
xi € c forsome0 «c «oo, Vi zl, A>Osuchthatac<1. (1.39) 


Then 


where 
A0 such thatdc x1 


is a supermartingale with respect to Fy. 
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Proof. Let us consider the Taylor’s expansion of exp(x) 
x? x 


x 
Ip pretigos 


expo) ced Tat n! 


For x = 1, we obtain 


l 1 1 
e€-2— —- lim (se) (1.40) 
00 n! 4 


For x = Axj, we get the following 


A2x? Anly"! 
— H J oe J 
exp(Axj) = Ld Axj + PT Tec n 
l Wy 1,225 233 anre 
xt Nx; er PPM Ru Jn ia oe Seated SS 
Jtt NETO aie 6 Y TESEIY 
From the fact that 0 < Ac < 1, (1.40), and (1.39), we obtain the following 
estimation: 
A2x? | Ax; an-3yn-3 
Nee opc M OR idu? MERE oe Cn NENNT JE 
exp(Axj) = 1 +Axj + 2 rens (z+ a Tec Grey! + 
x 1 1 l 
<1 +ixj t — |l tach =H eH Ime 
TE 31° 4! (n — 1)! 
i À 
< taytata (142) 


Taking the conditional mathematical expectation of both sides of this inequality, 
and considering the definition of x; = Uj — Uj-| and the fact that U, is 


a supermartingale with respect to the o-algebra Fa, i.e, E{Us+1 | Ede U;, 
we obtain 


a.s. 2 
E{exp(Axj) | Fj-1} < 1+ (5) (1 + =) E [= | 7,4]. 


In other words, the conditional mathematical expectation of the term Ax; plays no 
role. Taking into account that the right-hand side ofthis inequality can be considered 
as the first two terms of the Taylor's expansion of 


exp (4) ( + x) E {x7 tae |. 


Stochastic Processes 79 


we obtain the following estimation: 


a.s. A2 Ac 2 
E(exp(ax)) | Fj-1} E exp| (5) (1+ JE fx | Fj} (141) 
For any n > 2, in view of the definition of x; = U; — Uj;-1, we obtain 
n n-1 
Un = Yxi = Yai + Xn = Un-1 + Xn, 
which leads to 


n-i 
E(T, | Fn—-1} = exp ( Ea) 


i=l 
? AcY < 2 
x exp |- (5) ( + =) 25 E | a 
x Efexp(Axn) | Fn-1}- (1.42) 


In view of inequality (1.41) for j — n, we get 


n-l n 
E(T, | Fn—1} È exp ( Ys) x exp |- (5) ( t x) YE f? | s)| 
i=l 


i=} 


x exp 6 (1 + x) E [2 | Fai J 


By regrouping the second and the third terms of the right-hand side of this inequality, 
we obtain 


n-i 
E(T, | Fri} S exp ( Y) 


i=) 
Xi Ae AS 
qd 4.3 | Tb | . 
"e| (F)( + P: x; | i-a 
By definition, the right-hand side of this equality is exactly T5. ;, i.e., 


a.s. 
E(T, | Fn—1} € Tai 


and, consequently we obtain the desired result. 
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Before concluding this subsection, let us notice that: 


1. many martingales can be derived in association with Markov chains [35]; 

2. the driving noises, which are used in process modelling for characterizing 
the uncertainties and the perturbations acting on the process considered, are 
modeled by martingale difference sequences with respect to a nondecreas- 
ing sequence of c-algebras consisting of the measurements (input-output 
data) (11]; 

3. thepartial sums of martingales arise in the context of stochastic adaptive control 
strategies (model reference adaptive control and self-tuning controllers) [11]. 


1.5.2. Martingale Convergence Theorems 
We shall present some martingale convergence theorems. 


Theorem 2 (Optimal skipping theorem) Let (X4) be a submartingale with 
respect to {Fn}. Let &1,€2,... be random variables defined by 


e= |1 (Xi. Xe) € Be 
k= V0 if(Xi,.... Xk) ¢ Bk, 


where the By, are arbitrary sets in R”. Set 


Yi = Xi 
Y) = Xi + &(X2 — X1) 


Y, = Xi &(X2 — Xi) +-+- + €n—1 (Xn — X41). 


Then (Y, is also a submartingale with respect to {Fn} and E|Y,) < E[X4] for all 
n. If {Xn} is a martingale with respect to {Fn}, then {Yn} is also a martingale with 
respect to {Fn} and E[Y,] = E[X,] for all n. 


Remark 22 The above can be interpreted as follows: Let X, be the gambler S 
fortune after n trials; then Y, is our fortune if we follow an optimal skipping strategy. 
After observing X\,..., Xy, we may choose to bet with the gambler at trial k + 1 
[in this case ek = &y(Xi,..., Xy) = 1} or we may pass [ek = 0]. Our gain on 
trial k + 1 is € (Xy. — Xx). The theorem states that whatever strategy we employ, 
if the game is initially "fair" (a martingale) or "favorable" (a submartingale), 
it remains fair (or favorable), and no strategy of this type can increase the expected 
winning. 

Proof We get 


E[Y¥n+1 | Fn] = E[Ys + En(Xn+1 — Xn) | Fn] 
= Yp + EnE[Xn+1 — Xn | Fa]. 
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Since €n is a Borel-measurable function of X,,..., Xn, it hence is 
F(X,...,Xn) C £Z,-measurable. 
Therefore, 
E(Ya41 | Fn] = Yn + €n(Xn — Xn) = Yn in the martingale case 
> Yn + En(Xn — Xn) = Y, in the submartingale case. 
Since Y; = X4, we have E{X;] = E[Y;]. Having shown 
E[X, — Y,] = 0 (= 0 in the martingale case), 
Xii — Yk+1 = Xk+1 — Yk — Ek(Xk+1 — Xk) 
= (1 — ek)(Xk+1 — XO Xi — Yr. 
Thus, 
E(Xx41 — Yi | Fe] = (0 — ex E[Xisi — Xk | Fk] + ELXe — Ye | Fe] 
> E[Xx — Yk | Fk) = Xy — Yk 
with equality in the martingale case. Take mathematical expectations and use 
E[E(X | 7)] = E(X) 
to obtain 
E[Xk+1 — Yei] > E[Xy — Y] 20 


with equality in the martingale case. 


In what follows, we present a convergence theorem that can be interpreted as an 
analog of the fact that in real analysis a bounded monotonic sequence of numbers 
has a (finite) limit [60]. This theorem is the most outstanding strong convergence 
theorem [10]. 


Theorem 3 (Martingale convergence theorem) (see [16, 42, 50]) Let Xn be 
a submartingale with respect to {Fn}. Next, assume tha 


supE(|X,l) < oo. 
n 


23 Some authors use the following notation: 
Ji 
instead of 


lim sup fx. 
noo kzn 
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Then with probability 1, the limit 
lim X, = Xoo 
exists and 


E(IXocl) < oo. 


Lemma 2 (Pythagorean relation) For S, = 7, Xi, it holds that 
n 
E[G - s] = Y efx 
izm-4l 
for all m « n. 
Proof 
n 


(Sn— Sm = 35 X)? 42 SO xx; (1.43) 


i=m+1 m+l<i<j<n 


The desired result follows by taking mathematical expectations in (1.43), using the 
definition of orthogonality. 


Lemma 3 Suppose $`; E {(Xi)*} < oo. Then there exists a random variable 
S such that 


E{s*} «o and E{(S,-5)*} 40 
where Y ?-, Xi is denoted by S. 


Proof Let m « n be positive integers. 
n oo 
2 2 
ees ser] È epa] s È edt] ouo 
f=m+1 i=m+1 


by the Pythagorean relation and by hypothesis. Hence (S4,n > 1} is a Cauchy 
sequence? with respect to the L2 norm. Hence, by completeness of L3 there exists 


a random variable S with E{S?} < oo and E (($, — S)?} > 0. 


The next lemma is due to Neveu [50]. It concerns martingale difference sequences. 


24 A sequence {Xn} is said to be a Cauchy sequence (fundamental or self-convergent) if 
lim d(Xn,Xm) > 0 asn,m > oo, 


where d (., -) represents the distance between two elements in the corresponding metric space. 
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Lemma 4 (Martingale difference sequence) (Neveu [50]) Let X(t) be a 
Martingale difference sequence with respect to the o-algebra F;, that is, 
such that E{X(t) | F;-1} = 0. If 


E [ao 15a] s e? 
and 
E (KO ina] <k < o0 


then 


N 
: l 2 4.5. 2 
ey ut 


Proof Consider 
Ya) = ao - E { (XW)? na] = ao - 0. 
Then 
E [xe - e$ ina] =0 


(i.e., Y (t) has zero conditional mean). Also, 


= 1 wt 
D EIXO? — oP | Fa) = 3 LEK | F) - 204 +04) 
t=1 
— 1 
= ( 
t=1 


t=! 


< k — o^) < oo. 


The results follows from the Neveu Lemma ([50], see the appendix). 


The next definition is related to the Markov time. 


Definition 31 (Markov time) A random value T (w) that takes values in the set 
(0, 1,..., +00} is a Markov time with respect to a o -algebra F,, a random variable 
independent of the future, if for each n > 0 


(T =n} Ee Fy. 
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In other words, the indicator function of the event {T = n} can be written as a 
function of F, (random variable independent of the future). Some authors [63] 
use the term "stopping rule" instead of Markov time. 

The stopping rule is intimately connected with games. Chow et al. [13] state: 
Optimal stopping problems concern the effect on a gambler 5 fortune of various possible 
systems for deciding when to stop playing a sequence of games [...] where the experi- 
menter must constantly ask whether the increase in information contained in further data 
will outweigh the cost of collecting it. 


It is easy to show that the Markov time exhibits the following properties: 


]. if T and t are Markov times, then their sum is also a Markov time; 
2. min(T,r]is also a Markov time. 


Before ending this chapter, let us give some useful definitions. 


1.6. Some Definitions 
Definition 32 (Supremum) Let A be a nonempty set. We define 


M = sup(x:x € A} = supx 
xeA 


as the unique element M such that 


M >x, X € A. 


If 


then 
M* > M. 
In words, M is the least upper bound of the elements of A. 
Definition 33 (Infinum) Zn an analogous manner, we define 
m = inf(x: x € A} = inf x 
as the unique element m such that 


m € x, x€A. 
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If 


then 
m* « M. 
Put into words, m is the greatest lower bound of the elements A. 
Definition 34 Let us consider a sequence of real numbers {xn}. Then 
lim x, — x* 

noo 
if for every € > 0, there exists an integer N = N (e) such that 

|xn —x"| <€ 


whenever n > N. 


85 


The reader must avoid considering that, if the index n represents the time, as n gets 
larger and larger, the value of the sequence {xn} gets closer to x* [32]. Indeed, let 


us consider the following sequence: 


1 
X24 =O and x-1 = ———, n=1,2,.... 
2n—] 


This sequence tends to zero, but 
|xan — 0| < [x2n41 — 0l. 


Definition 35 (Limit of a sequence) Let {xn} be a series, and 


lim x, — lim supx, — inf Sup xk 
n> noo n>0 k>n 
lim x, = lim inf x, = sup { inf xx |. 
n-+00 n=7 00 n20 \K2n 

If 


Mn = sup xx, 


kən 
it follows that 


Mg > Maa. n>0 
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and 


lim sup x, = im n Mn = = lim n (sup xy). 
noo OO k>n 


In an analogous manner, if m, = infk>n xx, it follows that m, < m441,n > 0 and 


lim inf x, = lim nm, = lim (inf xx). 
noo n-00 kn 


Moreover, one can see that 


—oo < lim ninfx, < < lim sup x, < oo 
noo 


and lim,., oo Xn exists if and only if 


l= lim n inf x, = lim sup xp. 
noo 


We get 
= lim Xn, 
noo 
which is usually written as follows: 
lim supx, = lim inf x, = lim x,. 
n-»oo noo noo 


In words, when the smallest cluster point (lim,., oo inf x4) is equal to the largest 
cluster point (lim,— oo sup xn), this unique cluster is the limit of the considered 
sequence. 


Definition 36 The events A, are said to occur infinitely often (i.o.) if 


oo OO 
lim sup An = LJ An = As, i01. 
k=l n=k 
Definition 37 (Little-o) 
o(x) 


means that 


o(x) 
x0 


when x — 0. 
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Example 41 The Taylor's expansion of the functions exp(x) and sin x is given by 


x? x? 
n 
exp(x) = 1+x+——+---+—+x"e(x) 
2! n! 
3 2n41 
neer o phe docu m c NET 
sinx —x 3r t +(-1) Qnin elx), 


where 


lime(x) =0. 


These expansions can be written in the following forms: 


n 


x? x 
exp(x) =1l+x+>—+---+—+0(x") 
2! n! 


2n+1 


] x? X 2n+1 
sinx =x — = +---+(-1)" ola"), 


3! Qn + 1)! 


2n+1 


where x" £(x) and x £(x) have been denoted by o(x") and o(x?^ *!), respect- 


ively. 


We have also that o(x,) means that the sequence (x,) tends to zero quicker than 
Xn, that is 


Ox, 
(Xn) 0. 
Xn X20 


We have also that o, (x4) means the same but with probability 1 (when Xn(@)), 
that is, for almost all w (trajectories). So 0,,(1) means o4 (1)/const — n-.900, and 
O(yn / Pn (a)) means that it tends to zero quicker than y, / pn (o). 


Definition 38 (Big-o) 
O(x) 


means that 


« oo, i.e., this ratio is bounded when x — 0. 


oe 
x 


Remark 23 (Equality for two convex functions) The argument x* for which 
two convex functions f\(-) and f2(-) are equal is given by 


max min{ fiC), RO} = fi (x*) = fax") 


(1.44) 
min max( fi(), 20} = fi") = fax”) 
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Figure 1.24 Equality between two convex functions 


Indeed, consider the geometrical representation of two convex functions 
(see Figure 1.24). From this figure, it is clear that x, is given by (1.44). 


1.7. Conclusions 


This chapter was devoted to stochastic processes. We presented and described prob- 
ability space. The expectation as well as the conditional mathematical expectation 
were given and their properties were reported. Many examples such as random 
walk, Markov processes, Markov chains, renewal processes and martingales were 
presented. We frequently considered the sum of random variables, which plays 
an important role in many engineering areas. We recalled the main operations on 
states and showed their connection with probability calculus. Martingales arise 
naturally whenever one needs to consider mathematical expectations with respect 
to increasing information patterns. 


The detailed proofs of the theoretical results presented in this chapter can serve 


as the first step in a training process for the analysis of recursive schemes. In this 
sense, this chapter represented a kind of initial step to Chapter 4. 
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Chapter 2 


Estimation of Probability 
Densities 


2.1. Introduction 


This chapter is dedicated to probability density estimation problems [60, 69, 75] 
Probability densities are useful for different purposes. For example, the efficiency 
of liquid-liquid extraction columns depends on the probability distributions of the 
drops. In image processing the probability density function of the brightness is 
commonly used [17]. Probability distributions are of greatest interest in the areas 
of probabilities and statistics. Among other things, they permit the calculation of 
the mathematical expectation of a given random variable. Many techniques exist 
for their estimation. We will present techniques based on the moments, series 
of rectangular pulses, polynomial representation and kernels, focusing our atten- 
tion on three parametric approaches, namely, the expectation maximization (EM) 
[37, 61) and its extension based on the total kurtosis, the method based on neural 
networks [47], and finally an approach based on stochastic approximation tech- 
niques [77]. The emphasis here will be on numerical considerations pertinent to 
probability densities estimation. In fact, all the aforesaid shows the importance of 
these problems. Several simulations are presented in order to show the performance 
of these algorithms and to illustrate their implementation. 


This chapter is organized as follows. The main probability distributions are given 
in the first part while the second part is dedicated to several estimation techniques 
which are illustrated by numerical simulations. 


2.2. Main Probability Distributions 


Probability distributions play an important role in many stochastic problems such as 
those related to systems reliability, image signal processing, etc. In this section we 
present the well-known probability density and distribution functions. They will be 
denoted by f(t) and F(t), respectively. An introduction to probability distributions 
and characteristic functions was given in Chapter 1 (Subsections 1.2.4 and 1.2.6). 
Probability distributions can be classified into two main categories: binary-valued 
discrete distributions and continuous-valued distributions. 
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2.2.1. Bernoulli Distribution 


The Bernoulli distribution corresponds to repeated independent trials where there 
are only two possible realizations for each trial, and their probabilities remain the 
same throughout the trials. It is usual to denote the two probabilities by p and q, 
and to refer to the realization (outcome) with probability p as “success,” and q as 
“failure.” Of course, p and q must be nonnegative, and their sum must be equal to 
one, g = 1 — p. The Bernoulli distribution is given by 


Pr(X = 1)= p, Pr(X = 0) = q. 


2.2.2. Binomial Distribution 


Let us consider a set of m successive and independent Bernoulli trials. The 
probability of getting x successes in n trials is given by 


pix =x) = (7) ra - py, x=0,1,..... and 0<p<1 


and its characteristic function is 
(t) = [q + pexp(t)]”, 


where 


n! 


()-«- Ha! forn>k>0 


fork >nork <0. 
For n = 1, we obtain the Bernoulli Law. 


The binomial law is usually denoted by b(x; n, p) or B(n, p). The term “binomial” 
comes from the expansion of the Newton's binomial 


oror =D (re 
x=0 


The probability of obtaining x successes from n Bernoulli trials, when n is large 
and p is small, can be calculated by the following limit [4]: 


. n t (np)* exp(—np)  A*'exp(—À) 
im (;) eta - pr = SEES) a en 
p->0 
np = cte 


The mean (mathematical expectation) and the variance of a given random variable 
are usefully thought of as measures of location (mean) and dispersion (variance) of 
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the distribution of the random variable considered. The mean of an absolutely con- 
tinuous random variable corresponds to the center of gravity of a bar with density 
equal to the probability density of the random variable considered [18]. In separ- 
ation processes (distillation columns, absorption columns, liquid-liquid extraction 
columns, dryers, etc.), the mean can be constant and equal to the desired value, e.g. 
a concentration of a given component, but at the same time the necessary energy 
can fluctuate around its nominal value (which corresponds to the desired concen- 
tration), As a consequence, the energy consumption increases with the dispersion 
of the concentration of the component considered even if it remains constant in the 
mean (which corresponds to the desired specifications fixed by the customers) [54]. 


The main statistical characteristics of the binomial distribution, namely the mean 
and the variance, are given in what follows: 


e mathematical expectation: 
E[X] = np; (2.2) 
e variance: 


Var[X] = np(1 — p). 


2.2.3. Hypergeometric Distribution 
Let us consider an urn containing r red balls and b black balls. The total number 


of balls will be denoted by n = r + b. A set of m balls are randomly withdrawn 
from the urn. Let p, be the probability that the m balls contain exactly k red balls. 


Notice that the red ball can be chosen in i) different ways and the black balls in 


e B |) ways. The probability p; is then given by 


(2.3) 


Of course, this probability is defined only for k < m. 


A random variable X is distributed according to the hypergeometric distribution 
if for k < m, the probability Pr(X = x) is given by (2.3) [22]. For n > 2, 
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Poisson 


1.2 


5 0 5 10 15 20 


Figure 2.1 Two Poisson distributions Poi(X) with parameters X = 1 andi = 0 


the mathematical expectation and the variance are given by: 


e mathematical expectation: 
EIX] = —; (2.4) 


e variance: 


mr(n—r)(n — m) 


Var[X] — np 


(2.5) 


The hypergeometric and binomial laws are two of the most common probability 
models related to counting failures [5]. 


2.2.4. Poisson Distribution 


Poisson distribution (see Figure 2.1) is a discrete law that is commonly used to 
characterize independent random phenomena (number of events occurring in a 
given interval of time such as the number of files submitted for transmission during 
T seconds, the number of interruptions generated by a CPU during T seconds, etc.) 
[85]. The number of flaws per unit length of a wire, given the average number of 
flaws per unit length, follows a Poisson distribution: 


x 


fe- ~ epa, CEM and RSG 


b(t) = exp A(exp(it) — 1). 


If {S,;n > 1} is a sequence of independent random variables distributed according 
to the exponential law of parameter A = 1, then, for all a > 0, the integer random 
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variable [6] 
= inf{S; +--+ Sn > ajn > 1} 
follows the Poisson law with parameter a. 
e Mathematical expectation: 
E[X] =A; 
e variance: 


Var[X] = 
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Example 42 (Mean and variance of the Poisson distribution) Let us calculate 
the characteristics of the Poisson law, i.e., the mathematical expectation (mean) 


and the variance. 


Mathematical expectation: 
oo 
E(X] = > KF). 
=0 
For x = 0, we obtain 0, then (2.6) leads to 


x-1 


oo AA e 
E[X] = 2 X77 exp(A) = A exp(à) 2 (x—10! 


Am 


= À exp(À) »- — = Aexp(A) exp(—A) =A. 


meos 


Hence, the parameter À represents the mean. 
Variance: 


Var[ X] = E(X?] - (E[X]Y". 


Let us first calculate the second moment: 


jd 
EU] - Yo? — "ES Ya = exp(—A) 
x=0 
Ax 
E 3e — 1) + 1]2——— exp(-A) 


(x — 1)! 


qx} 


=) DET -Dy 9c» Yeon y PC »| 


(2.6) 


(2.7) 
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Recall that 

O yx oo yx-l 

2e exp(—A) — A and 2. Gc ec» = I. 
It follows that 


E[X?] 2 AQ. + D. 
The variance is then given by 
Var[X] = E[X?] - (EIX? 2 40.0 32 =a. (2.8) 
Hence, the parameter à also represents the variance [78]. 


This property (equality between the mean and the variance of a random variable 
distributed according to the Poisson law) is commonly used in practice in order to 
verify if a given random variable is distributed according to the Poisson law. Let 
us now derive some properties of the Poisson law. 


By definition, we have 


À* exp(—A) 


Pr(X =x) = px(A) = 3i 


The fundamental convolution relationship for the Poisson distribution [31] is 


px QJ) px(a) = px. + n). 


The successive probabilities px..1 (à) and p (A) are connected as follows: 
XPx(A) = Apx-1(A). 


The successive moments can be calculated recursively, e.g. by the recursive formula 
given by Aroian [2, 31]. 
dm i 
mi4| = Ami + A 
We shall end this description of the Poisson law by discussing Raikov's Theorem, 
which states that if the sum of m independent random variables follows the Poisson 
Law, then each of the terms is also a Poisson random variable [59]. 


The Poisson law plays an important role in reliability, quality control, agriculture 
(distribution in space or time of plants and animals, etc.), ecology, biology (the 
sampling of bacteria per square, the number of photons reaching the retina, etc.), 
medicine (number of focal lesions in virology, etc.), telephony, insurance (acci- 
dents, etc.), business (the number of transactions per day for a given stock, etc.), 
queuing theory (air traffic, etc.), sociology and demography, etc. Chapter 7 in [31] is 
dedicated to the applications of Poisson distribution and contains many references. 
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In many applications and studies related to limiting forms of distributions like 
binomial, Poisson, etc., the following Stirling approximation 


n! c J27nn" exp(—n) 
is very useful. 


From expression (2.1), we derive that the Poisson law constitutes an approximation 
of the binomial law. 


Remark 24 (On approximations) Jn view of (2.1), the Poisson distribution is a 
convenient approximation of the binomial distribution (see for instance Theorem 1, 
Chapter 2, p. 32 in [11]) in the case of large n (number of trials) and small p 
(probability of success) [22]. 


For large à ( > 10), the Poisson distribution can be correctly approximated by 
normal distribution with mean m = i and standard deviation o = 4/À. 


2.2.5. Gaussian Distribution 


Random variables that are Gaussian (normal) are of central importance in practice 
(the measurement errors are very often assumed to have a normal distribution). 
Blood pressure or height of males and females are normally distributed but with 
different means. It has been argued that normal theory is consistent with the data in 
the framework of reliability [5]. In general, when the random effects act additively, 
it is natural to assume that the observations are distributed according to a normal 
law. It is given by 


_ 1 (x —m) 
fQ) = eae o (S) š 


The expression of the characteristic function ¢(t) 
I ot? 
P(t) = exp | imt — =r 


is obtained by making use of the Cauchy contour integration theorem. An easy 
computation shows that m and o? represent the mean and the variance, respect- 
ively. Note that these two parameters characterize the Gaussian probability density 
function. In what follows we shall present the details related to the calculation of 
the characteristic function! . 


l The calculations can also be done by using the following well-known results: 
+00 


AC — B? 
f exp(- Ax? + Bx —C)dx = [exp (5#) j 


—oo 
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In view of the definition of the characteristic function, we get 


+00 1 2 
o(t) s axo (-5z2-) exp(itx) dx 


1 x? — 2mx — 20? itx + m? 
exp | — ——————5————— | dx 
20? 


Pp ) exp(itm — 120°) dx 


+00 
| 1l P (omarion) PAN n to? 
Ja F 20? dur MEC 


1 ( (x — (m + 07it))* 
exp | - ————— —— 


If the random variable X is Gaussian with mean m and variance c?, we sometimes 
write 


X ~ N(m,o?). 


Figure 2.2 illustrates two Gaussian distributions. The Gaussian law is symmetrical 
(not skewed). 


Gaussian 


5 0 5 10 15 20 


Figure 2.2 Two Gaussian (normal, N(m, a*)) distributions: N(1, 1) and N(6, 0.3) 
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The Gaussian random variables exhibit several interesting properties. As an 
example, let us consider a set of N independent random Gaussian variables 


Xj,...,Xw, where each variable X; is randomly distributed according to 
N (m;,02). The sum S = Y, X; is N (ms, o2) where 


N N 
ms =) mi and c2 E of. 
i=] i=] 
This result corresponds to an application of the central limit theorem. 


Theorem 4 (Central limit Theorem) Let X;,...,X, be independent and 
identically distributed random variables with finite mean and variance, i.e., 


E{X;} =m < œ, Var(X;) =0? < oo, 


then 
124 
n= a 
i= 


1 
is normally distributed, i.e., N (v. la?) 


Observe that in general the mean and the variance of a random variable are 
estimated using the following formulae: 


N 
1 

E[X]=m=— $5 (2.9) 

i=] 

1 
x m2 

VarlX] = x —Ài 2 m), (2.10) 
where N represents the number of the available observations (x;,i = 1,..., N) of 


the random variable X. 


Remark 25 (Sample variance) Observe that in the expression of Var[X] we 
have (N — 1) in the denominator. This is due to the fact that m represents only an 
estimation of the mean value of the random variable X. In fact, let us consider the 
following optimization problem 


N 
1 
in Var[X] = i —) ;— my. 
agm ar[X] arg mum N (x; — m) 


i=l 
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Differentiating, we obtain 


dVar[xX] B =2 TA 
Vus. ee 


The optimality condition leads to 


The variance has a minimum value if expression (2.9) is used, but this expression 
represents only an estimation. As a consequence, with (2.10) the estimation of the 
variance is certainly greater than the value given by 


DJ 
N > Gi — m). 
i=} 
The number (N — 1) represents what is called the degree of freedom. 


Finally, notice that in parameter estimation, the distribution of the estimations 
usually tends asymptotically to the normal distribution N (0, P), where P represents 
the covariance matrix of the vector parameters [36]. 


2.2.6. Truncated Normal Distribution 
The truncated normal distribution (see Figure 2.3) is given by 


l -a -mP\ . à 
fw) = jour exp (3 ) if |x| <x 
0 iflx| > x+ 


where 6 is a normalizing constant. 


2.2.7. Lognormal Distribution 
Let us consider a positive random variable X. If the random variable 


Y =inXx 


is distributed according to the normal law with mean my and variance of 
(N (my,o?)), then the random variable X is said to be distributed according to 


the lognormal law. 
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Truncated normal 


5 0 5 10 15 20 


Figure 2.3 Two truncated normal distributions 


The log function plays an important role in many engineering areas (chemical, 
mechanics, etc.), and in medicine, economics, etc. The rescaling of physical vari- 
ables permits one to highlight the relations (correlation) between physical variables 
or phenomena. In the control of distillation columns, the logarithmic composi- 
tions are used in order to effectively eliminate the effect of nonlinearity at high 
frequencies and also reduce its effect at steady-state operation. 


In practice, small disturbances often act as multiplicative rather than additive 
increases. In this situation it is natural to assume that the logarithm of the considered 
variable is normally distributed. The reparation durations are usually distributed 
according to the lognormal law. 


The probability density function is given by 
1 —(In x — 22 
x)= ex ——— M oM 
f(x) PPV res ( 25 


The lognormal distribution has two parameters, m and ô. Their relation to the 
location and dispersion of the distribution is given by the following: 


e mathematical expectation: 


82 
E[X] = exp (m + 5) ; 


e variance: 


Var[X] = e?^** (e® — 1). 
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Lognormal 


5 0 5 10 15 20 


Figure 2.4 Two lognormal distributions logn (1,5). The parameters logn (—0.3466, 
0.8326) and logn (1.7876, 0.0911) result in means of 1 and 6, and variances of 1 and 0.3, 
respectively 


The lognormal distribution is skewed to the right (see Figure 2.4). This distri- 
bution law has also been used to model, e.g., the cross-section or the amplitude 
of synthetic aperture radar (SAR) images [76]. The lognormal distribution is 
commonly used in reliability analysis, cycles-to-failure in fatigue, the amp- 
litude statistics of clutter (unwanted reflective waves from irrelevant targets) 
[67], etc. Economists often model the distribution of incomes using a lognor- 
mal distribution. In biology it has been observed that the growth of organisms 
sometimes proceeds by multiplicative rather than additive increments. Notice 
also that the particle size distribution of aerosols is usually approximated by a 
sum of lognormal distributions. 


2.2.8. Laplace Distribution 


The Laplace probability density function (see Figure 2.5) and its characteristic 


function are given by 
1 —|x —a| 
ode (55) 
i EN 


exp(iat) 


GET Z 


A random variable distributed according to the Laplace law has moments of any 
order [24]. 
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Laplace 


5 0 5 10 15 20 


Figure 2.5 Two Laplace distributions Lapl(a, B): Lapl(1, 0.7071) and Lapl(6, 0.3873) 


e mathematical expectation: 
E[X] =a; 
e variance: 


Var(X] = 282. 


2.2.9. Cauchy Distribution 


The Cauchy probability density function and the characteristic function $ (t) are 
given by 


1 À 


METETE 


A>0 


$(r) = exp(iut — Alt). 


The characteristic function $ (f) is not differentiable at £ = 0. As a consequence, 
none of the moments of the Cauchy distribution exist [24]. Arandom variable Y with 
the density exp(i ut — A\t|) is a particular case of a random variable with Laplace 
probability density function (X = AY + yw has a Laplace distribution) [24]. 


e Mathematical expectation does not exist. The corresponding integral diverges?. 
e Variance does not exist (the variance is infinite). The corresponding integral 
diverges. 


2 Chasles's relation must be used with care. Indeed, 
+00 0 +00 


x x x 
dx = d dx. 
] m lm 2 1+x2 2 


—oo0 —O00 
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Remark 26 The binomial, Gaussian, Poisson and Cauchy probability distribu- 
tions are commonly used in astronomy [57, 70]. 


2.2.10. Exponential Distribution 


The exponential probability density function (see Figure 2.6) is usually used for 
modelling the waiting time between two independent phenomena (telephone calls, 
etc.), e.g. modelling the successive interruptions generated by a CPU. In [14] it 
has been shown that the locomotion activity in cell cultures obeys the exponential 
distribution: 


f(x) 2 X exp(-Ax), x>0 and A>0O 


$t) = 


A—it 


Notice that the exponential distribution possesses the memoryless (lack-of- 
memory) property. A random variable x is said to be without memory, or 
memoryless, if 


Prix >s+et[x>r}=Pr{x > s) for s,t > 0. (2.11) 


It turns out that not only is the exponential distribution “memoryless,” but it is the 
unique continuous distribution possessing this property [11]. Indeed, 


Prix >s+t,x >t} 


Prix >stet{[x>t})= PH 


_ Prix >s +t} ER exp(—A(s + t)) 


Prix >t} ——s exp(—Ar) ENERBUS AS 


The first integral on the right-hand side of this equality can be written as follows: 


| Se] E a=- f y dy, for y = —x. 
1+ x? 1+ y? 1+ y? 
—oo Too 0 

This results leads to 

+00 

f : dx = 0, 
14 x2 
—00 


which is incorrect because we know that this integral is divergent. Indeed, the decomposition 
of this integral into two integrals 


+00 0 +00 

faf de f dx 
—00 

—00 0 


assumes that the integral of the absolute value JR |-|dx is not divergent. 
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Exponential 


1.2 


5 0 5 10 15 20 


Figure 2.6 Two exponential distributions exp(A). Exp(1) and exp(0.1667) have means 1 
and 6, respectively 


The mean and the variance are given by: 


e mathematical expectation: 


E[X] = =; (2.12) 


>i — 


e variance: 
1 


Notice that the hazard function associated with the exponential distribution is 
constant and equal to À. 


Example 43 Consider a parallel structure of N components. This system is func- 
tioning if at least one component is functioning. Let X; be the lifetime of the 
ith component (i = 1,...,N). Suppose that the lifetimes are independent and 
identically distributed according to the exponential law. Determine the probability 
distribution function of the lifetime of the system. 


Solution Let us denote by S the lifetime of the system. It is evident that 
S = max(X1,..., Xy). 


(Notice that for a series structure of N components, the lifetime of the system is 
given by S = min(X1,..., Xn).) 
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From the assumptions, it follows that 


Pr(S < s) = Pr(X; <s,...,Xn € s) 
N 


= Pr(X| <5)---Pr(Xn <5) = [ (e 
0 


= (1 — exp(—As))* = f fs(x) dx. 
0 


The probability density function of the lifetime of the system is 


Fs(x) = Ni exp(—Ax)(1 = exp( Ax)" 7l. 


2.2.11. Geometric Distribution 


Recall the Bernoulli distribution, the binary-valued discrete distribution discussed 
earlier in this subsection. Geometric distribution corresponds to Bernoulli trials 
where x now represents the trial that leads to the first success. It is used to model 
the number of successive events occurring until another event happens (numbers 
of successful file transfers until the first error occurs, etc.). 


fœ =Q- pp,  x2L2,., 0«p«l, 
u pexpit 
func (1 — q exp it) 


The mathematical expectation (mean) and the variance are given by the following: 


e Mathematical expectation: For k = 0, we obtain 0. Then (2.6) leads to 


E[X]- 24 (2.14) 
p 


e Variance: 


[=p 


Var[X] = 5 
p 


Notice that a geometric random variable with 0 < p < 1 can be obtained by 
the discretization of an exponential random variable with mean m = 1/A with 
p = 1 — exp(—2). Conversely, a geometric distribution with mean m = 1/p 
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Rayleigh 


1.2 


5 0 5 10 15 20 


Figure 2.7 Two Rayleigh distributions Rayl(B). Rayl(0.7979) and Rayi (4.7873) have 
mathematical expectations equal to 1 and 6, respectively 


is a discrete version of the exponential distribution with mean m — 1/A, p — 
1 — exp(—A). Finally, observe that discrete random variables distributed according 
to the geometric distribution are memoryless random variables. 


Remark 27 An extension of the lack-of-memory property, namely the almost- 
lack-of-memory property, has been introduced in [12, 13]. A nonnegative random 
variable (or its distribution) is said to have the almost lack of memory property 
if points nt; n — 1,2,... exist where it has the lack of memory property (see 
expression (2.11)). This property has been shown to be deeply related to periodicity, 
and consequently is useful for modelling events occurring on a seasonal basis [16]. 


2.2.12. Rayleigh Distribution 


The Rayleigh distribution (see Figure 2.7) is given by: 


This distribution law has been used to model the cross-section or the amplitude of 
SAR images [35]. Notice also that wind turbine performances are usually given on 
the basis of the Rayleigh distribution. 
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The mean and the variance are given by: 


E[X] = a [= 


Var{X] = B? (2 E =) 


e Mathematical expectation: 


e Variance: 


Remark 28 If X; and X» are independent N (0, a2), then the random variable 
X= /X : TX ; is distributed according the Rayleigh Law. 


2.2.13. Gamma Distribution 
The modelling and design of crystallizers is mainly based on the crystal-size distri- 


bution. The gamma probability density function is commonly used for representing 
the crystal-size distribution [73]: 


w= zug) EE) 


1 p 
o(t) = £ ux d 


where l'(8) is the gamma function (the integral F(x) = ie t*7le7! dt). The 
gamma law tends to the normal one as a decreases (see Figure 2.8). The gamma 
law is skewed. 


The mathematical expectation (mean) and the variance are given by: 
e mathematical expectation: 
E[X] = af; (2.15) 
e variance: 
Var[X] = o? B. 


If Y and Z are independently distributed according the gamma law with shape 
parameters m and n, respectively, then their sum is also distributed according to a 
gamma law with B = m +n. Conversely, a random variable S distributed according 
to the gamma law with parameter (n + m) can always be decomposed into a sum of 
two independently random variables U and V, distributed according to the gamma 
law, with parameters m and n respectively [45]. 
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Gamma 


Figure 2.8 Gamma distributions Gamma(a, B). The upper plot shows two Gamma pdf's 
Gammav(1, 1) and Gamma(0.05, 120). The lower plot illustrates the effect of changing a, 
from Gamma(0.5, 8) to Gamma(1,8), Gamma(1.5, 8) and Gamma(2, 8) 


For integer valued o and £, the gamma distribution is also known as the Erlang 
distribution. In the case of 6 = 1, it reduces to the exponential one. 
2.2.14. Weibull Distribution 


The Weibull distribution (see Figure 2.9) is given by: 


p-1 B 
f= 2() es|-(£) I a,B>0, x>0. 
ala a 


This probability distribution depends on two parameters, f, a “shape” parameter, 
ando, a "scale" parameter. As indicated by its name, this distribution was developed 
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Weibull 


5 0 5 10 15 20 


Figure 2.9 Two Weibull distributions Wei(a, B): Wei(1, 1) and Wei (6.3, 10.3) having 
approximate means 1 and 6, respectively 


by Weibull, in the framework of statistical theory of the strength of materials. 
The common use of this distribution is mainly due to the fact that the observed 
random variables correspond to the minimum of a large set of random variables 
acting independently [28]. 


The mathematical expectation (mean) and the variance are given by: 


e mathematical expectation: 


1 
E(X)=oer(1+-—); 
ui sar (145) 


2 2 

a 2 1 1 

VariX] = — {2r (5) —-— (r (5) 1 (2.16 
B | 8) BX VB 
Usually, manufacturers provide the user with the Weibull parameters for a given 


device (product) and the user can calculate the probability that a part fails after one, 
two, three or more months (years). The hazard function is equal to 


8-1 
pt) — "(z) : 
a\a 
For 6 = 1, the hazard function is equal to a constant. If 8 > 1, then the hazard 
rate increases with survival time. If £ < 1, then the hazard rate decreases with 
time. These properties explain the common use of the Weibull distribution in the 
modelling of optimization problems related to maintenance and replacement. 


e variance: 
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Remark 29 For B = 1, the Weibull law leads to the exponential one. Notice 
also that: i) as B increases (B > 3), the Weibull law tends to the normal law; 
ii) the Rayleigh distribution is a special case of the Weibull distribution. Indeed, 
for B = 2, The Weibull distribution is given by 


— 
(uy exp E (a) | 


which corresponds to the Rayleigh distribution for B = a//2. 


This distribution law has also been used to model the cross-section or the amplitude 
of SAR images. During the past twenty years there has been a considerable growth 
of interest in various Weibull-distributed ground, sea, sea-ice and weather clutter 
returns concerning false alarms and effective detection processes [67]. For example, 
the Weibull distribution is used to describe the wind speed pattern. 


The x? distribution will be presented in Subsection 2.8.1. 
This concludes our presentation of the main distribution functions, used for char- 
acterizing the probability for a random variable to take a particular value. A remark 


is in order here. 


Remark 30 Every nonnegative function f (x) that is integrable in the Riemann 
sense and such that 


Too 
J f(x)dx=1l 


defines a distribution function. 


Indeed, let us consider an RC circuit without resistive path to ground (see 
Figure 2.10). The time delay introduced by RC circuit networks in microelectronic 
technology has induced many studies and the developments of RC delay metrics 
[49, 50, 65]. 


The Laplace transform Y (s) ofthe output y(t) as a function of the Laplace transform 
U (s) of the input u(t) is given by 


1 
Y, (s) = Ty Res U 
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u(t) C y(t) 


Figure 2.10 RC circuit 


The impulse response (E(s) = 1) of this circuit is equal to 


1 t 
hu om (-zc) 


and satisfies the following conditions 
oo 
h(r 20 wt and [roa = ie 
0 


In view of what has been said before, this impulse response represents a probability 
density function [65]. 


The Laplace transform of the step response of a first-order system is given by 


l 1 


Y = -——, 
s(s) s1+ RCs 


Ust, 
s 


The impulse and step responses of a first-order system are depicted in Figure 2.11. 
This can be considered as the probability distribution. Indeed, 


1 
Y;(s) = -Ya (s). 
S 


The gamma [49] as well as Weibull distribution [50] have been used for the delay? 
metrics for RC circuits. 


3 In automatic control, the Laplace transform exp(—st) of the time delay t is usually 
approximated by a quotient of two polynomials such as 
1 — (z/2)s 
a EIE 
or as a series of n first-order systems, i.e., 
1 


m8) = es 
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step response 


impulse response 


0 2 4 6 8 10 


Figure 2.11 Impulse and step responses of a first-order system 


Example44 From the coefficients of the Taylor expansion at s = 0 of the Laplace 
transform of the impulse response of an RC system (first-order system), determine 
the mean, the variance and the third centered moment. 


Solution By definition, the Laplace transform of the impulse response h(t) is 
given by 


H (s) LI 
0 


Based on the Taylor expansion of exp(—st) at s = 0, we obtain 


86) = fro (Ese 24 | dt= n em «Jona 
m di cd 
J 0 


0 


= æo -- os +0257 +--+ +, 


where 


oo 
aj = CP f noas, j=1,2,.... (2.17) 
0 


116 Stochastic Processes 


Let m be the mean, i.e., 
oo 
m= f th(t)dt. 
0 


From (2.17), we derive 
m = —aj. 


The variance and the third centered moment are given by 


oo 
o? =m = fe —m)*h(t) dt 
0 


w3 = fe — my h(t)dt. 
0 


Based on (2.17), we obtain 


oo 
o =m= fe — 2mt + m) h(t) dt = 2a2 — a? (2.18) 
0 
oo 
Ha = fe — 3m? - 2m?t — m?yh(t) dt = —6a3 + 62102 — 2a?. 
0 


The next remark deals with the statistical interpretation of the impulse response of 
an RC circuit. In fact there exists similarities between this impulse response and 
gamma and Weibull distributions. 


Remark 31 By equating the centered moments related to the impulse response and 
those associate with gamma and Weibull distributions Lin et al. [49] and Liu et al. 
[50] have identified the parameters a and B characterizing these distributions. 
Remark 32 Many theorems exist that give the necessary conditions for a function 
to be a characteristic function: Bochner-Khinchin, Polya, etc. (see [68]). 
Remark 33 Let yi, y2,..., Yn be independent and uniformly distributed in [0, 1] 
and let 


n 
X := -2f wi log yi, 


izl 
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where w|,W2,..., Wn are positive weights. If these weights are unequal, then the 
exact probability density function of X is given by Good* [30]. If all w; s are equal, 
then one arrives simply at the Erlang distribution. If some of the weights wj are 
equal, it has been shown that the pdf of X can be approximated by a sum of gamma 
distributed density functions, a single gamma distribution or by the ratio of two 
gamma distributions [66]. 


The next section is dedicated to skewness and kurtosis measures (coefficients) [18]. 
These measures characterize the shape of a given probability distribution. 


2.3. Skewness and Kurtosis Measures 


The skewness and kurtosis parameters are both measures ofthe shape ofthe distribu- 
tion. Skewness (coefficient of asymmetry) gives information about the tendency 
of the deviations from the mean to be larger in one direction than in the other. 
The skewness is mainly an intuitive description of a given distribution. The third 
moment characterizes the asymmetry of a distribution. 


Let us consider the transformed random variable (X — E(X]/o). It has zero mean 
and unit variance. The skewness coefficient is given by 


o c? 


= 3 M 3 
TERE E) |- Elx E{x})3} 


A positive value for skewness indicates that the data are skewed to the right; a neg- 
ative value indicates that the data are skewed to the left (see Figure 2.12). In other 
words the skewness coefficient measures the departure from symmetry. The normal 
distribution has skewness equal to zero. 


The kurtosis of a probability distribution of a random variable x is defined as the 
ratio of the fourth moment 444 to the square of the variance o^, i.e., 


M4 LE {(: - zy Eix — E(x)* 


ot c c^ 


Kurtosis is primarily a measure of the heaviness of the tails of a distribution. The 
normal distribution has a kurtosis equal to 3. This statistic is standardized so that 


an W; 

4 1 
xa) =-=}, s ———  exp-L—. 
" 2 5 jzjgti - wj) P- w; 
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Figure 2.12 Skewness measure 


a normal distribution? has a kurtosis of 0. The normalized excess is given by 


For a Gaussian distribution, the normalized excess is zero. For a Laplace distri- 
bution, the kurtosis u4/ us is equal to 6. Negative values of normalized kurtosis 
indicate that the distribution has heavy tails (see Figure 2.13). The kurtosis rep- 
resents a generalization of the mean and variance measures used for location and 
dispersion for a Gaussian distribution [18]. 


The following example shows how to calculate the skewness coefficient. 
Example 45 (Skewness) Calculate the skewness of the following distribution: 


3.2 
x^ forü«x «2 
fei! 


0 otherwise. 


(The distribution is illustrated in Figure 2.14a.) 


5 It is well known by economists that the kurtosis of daily stock returns is larger than the 
kurtosis of a normal distribution. 
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Figure 2.13 Kurtosis measure 


a b 
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Figure 2.14 Two distributions used in the examples. The dashed line shows the normal 
distributions N (3/2, 3/20) in plot a), and N (0,2) in plot b) 
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Solution The mean m, the variance, and the 3rd moment about the mean are 
given by 


2 
3 
m= [sitas = Br =F 
0 0 
f 3 
2 23,2 
= = — dx = — 
c D fe m) 3x" dx 20 
0 
f 1 
2 A rode 
m= fo m) gx' d 29 
0 
The skewness coefficient is then equal to 
—1/20 
pecie nac e peed 


i.e., the distribution is skewed to the left. 
Example 46 (Kurtosis) Calculate the kurtosis of the following distribution: 


ss 5 exp(—x) forx 20 


5 exp(x) forx <0. 

The distribution is illustrated in Figure 2.14b. 

Solution The symmetry of the distribution function f (x) leads to 
m= 0. 


The variance and the fourth moment about the mean are given by 


0 oo 
= J (x — my exp(x) dx + fe — m}? 4 exp(—x) dx 
0 


0 oo 
u3 = J (x — m) exp(x) dx + fe — m?i exp(—x) dx 
- 0 


=3-3=0 
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0 [ 
pa = f (x — m) } exp(x)dx + f (x — m) 4 exp(—x) dx 
-00 0 


= 12 + 12 = 24. 


The skewness coefficient is then equal to 


u3 


toga Th 


(the distribution is symmetric) and the kurtosis coefficient gives 


Example 47 (Skewness) Calculate the skewness of the impulse response of the 
RC circuit. 


Solution From (2.18), we derive 


H3  —6æ3 + 60102 — 2a? 


AY = — = 


91 3 
( 2a2 — 2 


In the next section we define the probability distribution of a transformation of 
a random variable. 


2.4. Classification of Probability Distributions 
Let us recall the Gaussian distribution 


- 1 (x — my 
f(x) = 2795 5p (5x27) E 


Its derivative is given by 


F _ af) — 1 x—m 6 - my 
AUS dx — —( c? ) exp ( 20? ) 


and leads to the following differential equation: 


df(x) _ x—-m 
dx --( c? ) £00 
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which is a particular form of the following: 


df (x) 


x" (bc +bix+ by) S7 2J V dx = x" (x —a) f(x) dx (2.19) 


for 


2 


n=0, botb\x+box2 =-o* and a=m. 


Equation (2.19) is satisfied by many probability distributions [71]. Integrating by 
parts the left-hand side of (2.19), we derive 


[x" (bo +bix + box?) reo 7 


+00 
= f [nbos"-! 4 (n+ Dbix" + (n+ 2)" | f(x) dx 
—oo 
+00 +00 
= f x"*] f(x) dx —a f x" f(x)dx. (2.20) 
—00 —00 


Assuming that the term in square brackets vanishes at the extremities of the 
distribution, i.e., 


lim x"*+? fœ) — 0 
x— +00 


then, substituting moments for the integral in (2.20), we obtain a recursive formula 
—nbomn-1 — (n + l)bimn — (n + 2)b2Mn4+1 = mai — am, 

or, in an equivalent form, 

nbomn_-1 + [(n + Db1 — a] ms + [(n + 2)b2 + 1] m444 = 0. (2.21) 
These recurrence relations permit the calculation of any moments from those of 
lower orders. Based on the differential equation (2.19), we can derive the explicit 
expression of the probability density function f(x) by integration. Notice that the 
integration of this differential equation depends on the roots of the polynomial 

bo + bix + box’, 


which depends on the discriminant 


A= bi — 4bob2 
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or on the parameter. 


x= bi . 
4bob2 


(2.22) 


The parameter x has been introduced by Pearson as a criterion for the classification 
of a large class of probability density functions. By considering the mean m as the 
origin, the coefficients bo, b; and b» can be expressed as a function of the central 
moments m2, m3 and m4: 


m3(m4 + 3m?) 

bi =a = Thee mn — 12592 — Rma 
10m4m2 — 12m; — 18m5 

m2(4m2m4 — 3m2) 
10m4m2 — 12m3 — 18m3 
(2m2m4 — 3m3 — 6m3) 


peace E 
2 = Yomam; — 12m? — 18m3 


Consequently, the Pearson criterion can be written in the following form: 


FUR Bi (B2 + 3)? 
4(2B2 — 381 — 6)(4B2 — 381) 
where 
2 
ml i1 
Bi m? Bo m 


2.5. Transformation of Random Variables 


Let x be a random variable with a probability density function f, (x), and consider 
the following transformation 


y= (x), x = vy). 
We assume that: 


1. the first derivative of $ (-) (the first derivatives in the multidimensional case) 
exists and is piecewise continuous; 

2. ¢(-) is not constant in any set of the values of the argument x having a 
probability different from zero. 
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In view of these assumptions, we get 


Pr(y € B) = f AOT ONIH | dy 
B 


_ 190) 


—] / 
9 (y) "e 


where B is any set. Hence the probability density of the random variable y is 
given by 


fo) = f@7 Q»Ie7! OI. (2.23) 


In what follows, we shall prove this result for monotonic functions $ (-). Figure 2.15 
shows the two possible evolutions (increasing or decreasing) of the function $ (x). 
Let us first consider the case where $ (x) is an increasing function in the given 
interval [a, b]. 


From the left side of this figure (positive derivative), we derive 


x vo) 
Pr(Y < y 2 Pr(ac X < x)= [ 4e = J Fx (x) dx. 


a 


The calculation of the derivative of this integral with respect to y gives 


ho = OIOI oy,  e0oy»0. (2.24) 


Figure 2.15 Monotonic function 
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Now let us consider the second case (see the right side of Figure 2.15, negative 
derivative), which corresponds to a decreasing function $ (-). We obtain 


b b 
Pr(Y < y) = Prix < X « b) = f ax = J f(x)dx 
x vo) 


and consequently 
fo) =- ODIT OY, POY «0. (2.25) 
The combination of (2.24) and (2.25) leads to the desired result. 


Remark 34 When the function $(-) is not monotonic, the term Pr(Y < y) can be 
calculated by considering the intervals where the function $ (x) is located under 
the segment AB (see the shaded region of Figure 2.16). 


Remark 35 (Linear transformations) /fthe transformation (x) is linear, i.e., 


o(x) =ax +b, 
then 
ae déó(x) _ 
Lapi O-b), dx — 
and 


1 
f) = fx ( (y - 2) lal. 


a 


Figure 2.16 Nonmonotonic function 
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This last expression shows that if x is normally distributed, then y = ax + b is also 
normally distributed. 


Example 48 Let x and y be two random variables with a joint probability density 
fj, y). Calculate the probability density of the random variable z = x + y. 


Solution Equation 
z=x+y 


has a unique solution towards the formula (2.23) 
+00 +00 
o= f fiz-dx= f fe-yrdy e% 
=00 -—00 


If the random variables x and y are independent, then 


fix, W=fOSF). 


As a consequence, we derive 
+00 +00 
RO= | 5o5e-0a- | 5e-»504. ean 
-00 —-co 


These integrals represent the convolutions’ of the functions f,(x) and fyQ). 


Example 49 Let the random variable X be distributed according the Gaussian 
law 


fo- 


1 Bs (5) 
oJ P728 J 


Determine the distribution of the random variable defined by 


Y — exp(X). 
6 Let X1,...,Xn be random variables, and let f; be the probability distribution function of 
xi, i = l,...,n. Then the r.v. x; are independent if and only if 


fis- xn) = Pr(xi < X1,....Xn < Xn) 


n n 
= [ [Pri < xo = [] AG, 
i=! i=l 
where f(x], ...,Xn) represents the multivariate distribution of the r.v. x1,..., Xn. 
7 Doob says: “The study of many of the properties of the partial sums of the series 9 7^ | Xn 
can be carried out as a study of iterated convolutions, with no reference to probability.” 
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Solution From (2.23), we derive 


1 
fo) = füogy) [:]. 0 « y «oo 


— SE (Serm) 
^ yoX2n p 202 


which corresponds to the lognormal distribution. 


Remark 36 A complex random variable Z is defined as follows: 

Z=X+iY, 
where X and Y are real random variables. A complex random variable can be 
represented in the plane by a point, the coordinates of which are X and Y. The 


mean and the variance respectively are defined by 


m; =m; +imy 


Var(Z) = E{|Z — mU), 
where 
m, = E{X}, | my = E{Y}. 


The variance of a complex random variable can be expressed as a function of the 
variances of their real and complex parts. Indeed, 


Var(Z) = E {IZ -nj] =E fix eir -mx - imyÊ} 
=E [ix — my, t i(Y - imr] =E [x — mx? 4 KY -imyr] 
=E [ix -n +E [i - imr] 
= Var(X) + Var(Y). 


Finally, the covariance associated with two complex random variable Z, and Z2 
is given by 
Kaz = E{(Z, — mz,)(Z2 = m,,)*} 
= Kax Ky +i(Kyyx, — Ku), 
where (z)* represents the conjugate ofz, and Kx,x., Ky, yy, Ky x;, Kx,y, correspond 


to the covariance of the couples of random variables (X1, X2), (Y1, Y2), (Y1, X2), 
(X1, Y2). 
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In what follows, we shall deal with a nonlinear transformation of independent 
random variables. 


Example 50 Let Xj, X2 and X3 be independent random variables, and let Xi 
and X be normally distributed with 


E[Xi]- 0 and E[X?], i=1,2. 
Calculate the probability distribution of 


_ Xi t+ X2X3 


V1 + OG» 


Y 


Solution Jn what follows, we shall denote the mathematical expectation of a given 
multivariable function g(X\,X2,..., Xn) toward the random variable X; by 


Ex, {g(X1, X2,...,Xw)}. 
We derive 


X,+ X2X3 ) 


d d 
fvQ) = di Pr(Y < x) = zm < 
= E Pr(xi < xV1+ (X3)? - X2Xs) 


= ER [er(n « x / 1 - (X3)? — X2X3 | X2, X3)] 


dx 


d 
= Ex,,x; lZ Pr(X1 < x 1 + (X3)? — X2X3 | X2,X3)} 


xa/14+(X3)2?—-X2X3 2 
d 1 s 
TE ze 
—oo —— 
distribution of X 


(1 T OG - xX) 


2 


1 
= Ja Pens |^ + (X3) exp { — 


Here we have used the fact that for a convergent integral we have 


xaf14+(X3)2—X2X3 xA VE (X3) X5X5 


dx dx 


—0Q0 —oo 
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and the following formulae: 


g(x) 
d d 
= | g(s)ds = g(p(x) "o 


So we have 


242 x214X2X12 2x X2 X3 / AM-QGY 
fy (x) = Ji EXX [Vie ate [-* x xX2X3/1+(X3) || 
see) n / 2X2.X1x2 2xXXs / VG)? 
E exp = Ex, | 1+ OG) ep |-* XAQAZ (X3) H 


129 


oo 
- 2/2) 2X1. X2 x2-2xX9.X3./14(X3)" 
_ OP ail ^s 1+ (X3)2 f e [-* 24-X2Xj-2xXi SE 


X2=-00 


exp(—X3/2 
x gita 
—x2/2 
= Saa | j1 + (X3)2 


oo 
2y2 2 21. 
z f exp |! X3+X}(1+X}) eoa 1 08] ax, 


X2=-00 


Let us introduce the following variable: 


z= V1 + (X3)?X2 
dz | x; fixed = V 1 + (X3)? dX2. 


So 
oo 
= exp(—x?/2 1 exp(— (2 -2xzX3--x2 X3)/2 
fro = UEP Ex, | E J e ae 
zZ=—00 


Observe that 


z — 2xzX3 + X? X3 = (a —xX3)* 
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then it follows that 


Gaussian distribution 


exp (—x?/2) 1 f 2 
T1 05 egeo a f exp(—(z — xX3)?/2)dz 
~y 2n von o 
Ne, a 
-in 
__ exp (—x?/2) 
=- 


So, 


E[Y]=0 and E[Y?]=!1. 


The next section sets up some techniques for probability density modelling and 
estimation. 


2.6. Estimation of Probability Density Functions 


In this section, estimation of probability density functions is considered. The 
estimation of pdf’s has been under study for a long time, and various methods 
exist. In general, most methods aim at modelling the unknown pdf using basis 
function expansions, p(x) = dv: a; gj (x), where g; are the basis functions and oj 
are constants. We will start by looking at polynomial expansions. Provided that the 
moments of the distribution are available, the coefficients can be determined by the 
method of moments. With orthogonal polynomials, the expansion coefficients are 
particularly simple to calculate. Next, the kernel estimators are presented, which 
provide a smoothing approach to pdf estimation. Given the smoothing parameter, 
the Rosenblatt-Parzen method can be used to provide an estimator of the pdf based 
on observed data. 


A finite mixture density has the form Pr(X) = yt x; Pr;(X), where X is the 
random variable, Pri (X) are M mixture component densities and 7; are mixing 
proportions. The mixture density estimation problem (see, e.g., [61]) for a given M 
involves fitting the component densities and mixing proportions to an observed data 
sequence. The individual component densities are assumed each to be described by a 
finite setofparameters. Maximum likelihood presents the main tool for solving mix- 
ture density parameter estimation problems. The famous expectation maximization 
(EM) algorithm is presented in Subsection 2.6.6, along with a variant due to Vlassis 
and Likas [79] for automatically selecting the number of components M. The max- 
imum likelihood approach can be applied with more general model structures, 
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too. In the final subsection, sigmoid neural networks are considered in particu- 
lar. Numerical examples are given in Section 2.7. Section 2.8 considers model 
validation and the chi-squared distribution. Finally, determination of an optimal 
distribution using stochastic approximation techniques is discussed in Section 2.9, 
followed by conclusions. 


2.6.1. Method of Moments 


Usually, from physical considerations we can get information about the probabil- 
ity density of a given phenomenon. From the available information, the moments 
associated with the random variables considered can be computed, and the para- 
meters of the theoretical probability density function are calculated by matching 
the moments, i.e., equating the moments associated with the theoretical probability 
density function with the moments computed from the available data. 


For example, if we assume that the random variable considered is distributed 
according to a normal law N(m, c?), the parameters to be estimated are the mean 
m and the variance c?. From the available data, the mean and the variance of the 
random variable considered can be estimated as follows: 


ii ti 2 Eii —m)? 
n=, o= eas TR 


M 


In the case where a random variable is a priori assumed to be uniformly 
distributed, i.e., 


1 
Jos PIS fora<x<b 
0 forx < a,x > b, 


by matching the mean and the variance, we obtain 


a+b ; (b-a? 
y. EET 
2 12 


m= 
and consequently 
a =m- s30, b=m+ 3o. 


Let us recall that in the framework of delay metrics for RC circuits, the gamma 
and the Weibull distributions have been used by Lin et al. [49] and Liu et al. [50]. 
They calculated the parameters associate with these distributions by considering 
the RC circuit impulse response as a probability density function and by equating 
the central moments given by these probability density functions to the centered 
moments calculated from the RC circuit impulse response. 
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a3 
ap 


ay 


ag 
xg X1 %2 Xi] X; 


Figure 2.17 Approximation of a probability density function by a series 
of rectangular pulses 


In summary, if the adopted model, i.e., the probability density function, involves n 
parameters, it is necessary to calculate n moments from the available data. Observe 
that when the order of the moment increases, the accuracy decreases quickly. 


The next subsection deals with the use of series of rectangular pulses for modelling 
of probability density functions. 


2.6.2. Series of Rectangular Pulses 


The probability density function can be approximated by a series of rectangular 
pulses as shown in Figure 2.17. 


N 
fœ) = J aiu(xi-1, xi), 
i=] 


] forx;-) <x <x; 
0 otherwise. 


u(xi-1,Xi) = | 


Notice that a rectangular pulse can be decomposed into a set of unit steps, i.e., 


u(xo, xi) = e(xo) — e(xi), (2.28) 
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where e(x) represents the unit step, i.e., 


_ JO forx «xo 
ew» = {4 for x > xg. 


In other words, equation (2.28) can be interpreted as the sum of two steps, one step 
of magnitude equal to 1 occurring at x = xo = 0 combined with a second step 
of magnitude equal to —1 occurring at x = x1. u(xi 2, xi -1) corresponds to the 
rectangular pulse u(xi 1, xi) shifted back by the delay (x;-1 — xj-2). 


2.6.3. Modeling Using Polynomials 


The approximation of functions using polynomials is commonly used in many 
engineering areas (process identification, sensors and actuators modelling, devel- 
opment of correlations, filtering of signals, etc.). In this section, we will present 
the approximation of probability density functions using polynomials, i.e., 


N 
f(x) = Y cipia). (2.29) 


i=l 


The probability density function is considered to be a linear combination of a priori 
known polynomials $;(x) (i = 1,..., N). The polynomials can be chosen to be 
orthogonal (Hermite, Chebyshev polynomials, Legendre, Edgeworth series, etc.) 
[72] or not. 


Let us assume that the moments of the random variable x are available. The N 
unknown parameters c; (i = 1,..., N) can be calculated by equating the values of 
the available (N — 1) moments jz; with those calculated from the model (2.29), 


N 
Xa [ 9c =uj(j=1,...,N — 1). 
i=] X 


Thus, we obtain (N — 1) equations. We need another equation in order to calculate 
the parameters c; (i = 1,..., N). The Nth equation is given by 


N 
[ 19 Xi [ oar = 1. 
X izl X 


Consequently, the determination of the parameters c; (i = 1,..., N) leads to 
a solution of a set of N algebraic equations. The problem when the probability 
density function f(x) is not a priori known will be considered in the following 
sections. 
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Another approach is the expansion [20] of a given probability density on the basis 
of orthogonal polynomials towards a reference distribution, i.e., 


f(x) = fo) cdi — 2 =1,2,..., (2.30) 


i=0 


where $j(x) are orthogonal polynomials towards the distribution reference 


fo (x), i.e., 


+00 


The orthogonality makes the calculation easy. In fact, as for the Galerkin method 
which is used for the approximation of the solution of differential partial equations 
(see for instance [55]), we will multiply (2.30) successively by ¢; (x) (i = 1,...,n) 
and integrate the result obtained from —oo to +00. We obtain 


+00 
a= f ahas, i=0,1,2.... 


—oo 


Kuznetsov et al. [44] have developed methods for using Hermite series expan- 
sions in order to characterize multidimensional distributions of random processes 
for purposes including nonlinear filtering. They refer to the parameters of these 
expansions as quasi-moments. | 


2.6.4. Kernel Estimators 


Kernel estimators were first proposed by Nadaraya [56] and Watson [82). The 
methods related to the estimation of densities are closely related to this estim- 
ator. Nadaraya and Watson propose an interpolation procedure. Let X4,..., X, 
be independent and identically distributed random variables with density f(x), 
x € R. The Rosenblatt [64] and Parzen [58] estimator of density f(x) is a 
suitably smoothed histogram. It is given by 


Ns X. 
Xi... X. x) = K Zn 
Fn 1 n xX) wh, 2 ( hn ) 


where K(-):R — R+ is the so-cafled kernel function which represents a fixed 
bounded density function (uniform distribution on the interval [— 1, 5], or standard 
normal distribution), {hn} is a sequence of positive numbers, hy — 0 as N — co 
representing the bandwidth (window width). 
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There exist many variants of the Rosenblatt~Parzen method. In what follows, we 
present a recursive variant, which involves more weights. Let 


^ 1 x— X, 
X,x)=—K 
fi(X1, x) Ds ( i: ) 


P E 1 -X 
fiXi, X2,x) = (1 — a2) fi (X1, x) + a277 K (* 2) (2.31) 
2 h2 


“ . i E 
fn(X1, X2,x) = (1 7 an) fuac i(Xi, 3) + anz- K (* h 3] 


This algorithm can be also written in a nonrecursive form: 


m. gh Ly TIE: 
gooey Xn X) = — iT" , 
oe Bn Yih; hi 


i=l 


where 


yi =ai [Ja -2? ; (i = 2,...,n) 


This recursive variant has been considered by Wolverton and Wagner [86] and 
Yamato [87] with 


1 
an=- and f,-n. 
n 
It has been shown (see p. 29, Theorem 4.3 in [51]) that 
fa(X1, -Xn x) 55 f(x). 
The wavelets have found many applications in numerical analysis and signal 
processing. They have also been used for probability densities estimation. Their 
applications are usually limited to wavelets of small dimension. In order to deal 


with the curse of dimensionality in nonparametric estimation, wavelet networks 
have been proposed by Zhang [88]. 


The problem related to the estimation of the mixture density parameters can be 
tackled in many ways. We shall present some techniques and focus our attention on 
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parametric methods. In what follows, we present an introduction to the maximum 
likelihood and expectation maximization approaches. Based on simple examples, 
we show how to use these approaches for solving some parameter estimation 
problems. 


2.6.5. Maximum Likelihood 
Many parameter estimation problems have been formulated for a very long time 
as the maximization of a function called the maximum likelihood. This subsection 
is devoted to an introduction of this function and its use for solving some simple 
identification problems. 
The likelihood contains all the information the data can provide us about parameters. 
Let us consider a set of observations (realizations) x1, x2, ..., x, with distribution 
function depending on an unknown parameter 0. The likelihood function of 0 is 
given by 

L(xi,...,xy;0) = Pr(xs,...,xw5;0). 


Let us consider a set of N independent and identically distributed samples: 
X1, .., XN With a density function p. The likelihood of the data set is given by 


N 
L(x,....xw) = [| | 59 (2.32) 


n=l 
Generally, one wishes to maximize this function. 


Example 51 (Maximum of the likelihood function) Consider a Gaussian 
variable X with known variance o?. Let us estimate its mean value m. We have 


"ES 1 —(x — my 
Pr(X =x) = Pe ta (E) . 


The likelihood function is given by 


pow SIE 5 
L(xi,...,xN;m) = (=z) exp (z: di —m) ) 


The maximization of this function leads to the minimization of 


N 
Yoi — my. (2.33) 
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In view of the properties of the least squares method, we derive the minimum of the 
previous term (2.33): 


1 
masts EN), 


which is the commonly used estimator for the mean’. 


Maximization of (2.32) is equivalent to maximization of the logarithm of the 
likelihood function: 


N 
IG, xu) = log Las... xw) = Y log p(xn), 
n=1 
or to the minimization of the negative log-likelihood: 
N 
l(xy,...,XN) = — $log pn). 


n=} 


The following example is borrowed from [24] (Example 13.7.1 in Chapter 13). 


Example 52 (Maximum of the log-likelihood) Let us consider n independent 
observations of a Poisson random variable X, 


HS Kisela = ka 
Estimate the expected value of X, i.e., À. 
8 Observe that the estimation given by the methods of moments, which consists of equating 


the first r moments to their estimation, leads in this case to the estimation given by the 
maximum likelihood estimator. Indeed, we have 


N v N (y 
m= 2 = and o? +m? = Eie ; 
Solving this system yiełds 
Y Xi 2 LAN " 2 
= d za i= = 
m N and c N 
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Solution The likelihood function is given by 
maki 
L(x),...,Xn, A) = exp(—A —. 
Qr... ,38,À) = exp( Ts 


The maximum likelihood estimation of the parameter À is obtained by setting the 
derivative of the log-likelihood to zero: 


8log L(x},...,Xn,A) Min 
2 =—n+ = 0, 
which leads to 
ja Sih o Eia, 


n n 


Three important theorems are given in [24]. From these theorems and the Cramer- 
Rao (information) inequality [18], it appears that the maximum likelihood estimator 
is an asymptotically unbiased? and asymptotically most efficient estimator of 
unknown parameters [24]. 


The bulk of the remainder of this chapter is devoted to the study of some import- 
ant techniques used for mixture densities estimation. The next section presents 
the expectation maximization (EM) algorithm, which is based on likelihood 
optimization. 


2.6.6. Expectation Maximization 


The expectation maximization (EM) is an iterative procedure for approximating 
maximum-likelihood estimates for mixture density problems [15]. The EM is a two 
step procedure, namely, expectation (E-step) and maximization (M-step). It can be 
easily implemented using computers. We will present two algorithms presented in 
Vlassis and Likas [79]. This method is commonly used in the framework of many 
engineering areas, such as reliability. For example, EM has been used by Lim [48] 


9 Let us consider an unknown parameter 6 and its estimation 6,. The quantity 
E{6n} — 0 
represents the bias. The estimator is said to be unbiased if 
Efn} = 0. 
The estimator is said to be asymptotically unbiased if 


,Um E {On} z. 
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for estimating the lifetime distribution (two-parameter Weibull distribution) of a 
repairable system based on consecutive inter-failure times of the system. The EM 
is stable and robust for finding the maximum likelihood estimator. 


Notice that a random variable is said to follow a Gaussian mixture if its probability 
density can be expressed as a finite weighted sum of Gaussian densities, i.e., 


M 
fe) 2 afl j), (2.34) 


j=l 


where f (x | j) represents the univariate Gaussian N (mj, oj) parameterized on the 
mean mj and the variance o}. We assume that M is a priori given. It is up to the 
designer to specify the number of kernels!® [19]. The following constraints 


M 
Š ai=], a >0, (2.35) 
i=] 


ensure that f (x) isa probability density with integral equal to 1 over the input space. 
Observe that for M = 2, condition (2.35) is equivalent to à? = 1 —a@ 1, œ2 € (0, 1). 
It has been shown that the Gaussian mixture model is general and can approxim- 
ate any continuous function that has a finite number of discontinuities [54]. This 
result is not surprising, because many physical processes follow the Gaussian law. 
In other words, the Gaussian law can be considered as the basic element of a 
“Lego” toy. 


Remark 37 The modeling of a distribution function by a Gaussian mixture can 
be, in some sense, compared to the representation of arbitrary functions as a 
combination of “simpler” functions, in order to facilitate the analysis of the 
response of a linear dynamic system to an arbitrary input. 


Let us assume that a training set X = (x,...,xg), of S independent and identically 
distributed (iid) data are available. These data take values in R. We deal with the 
estimation of 3M parameters of the mixture (2.34): 


0 = (a1,m},01,...,¢4,my,omj). 


The problem to be solved corresponds to the estimation of the optimal parameter 
vector 6* that maximizes the likelihood function. It is stated as follows 


0* = arg max L(0), (2.36) 


10 A kernel on R” is a bounded, positive function on such that 


K(0) » 0, [ «047 [ Poa <c and f KO ar < œ. 
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where the likelihood function is given by 


L(0) = I] f (xi). (2.37) 


i=] 


This optimization problem can be considered as a problem with hidden variables. 
The parameter vector Ó is calculated as the value which maximizes the following 
function: 


Q(6 | 6(n)) = Ey[log Pr(X, Y | 6) | X,6(n)]. 


We shail use the EM method (2.36). At each instant n, the EM algorithm is given 
by the following iterative procedure: 


AY 
1 
aj(n) = z) PU lxi) (2.38) 
i=] 
Ya PU | xxi 
miy = is (2.39) 
d LL, PU lx) 
M a ea slm 2 
o2(n) = E ud EON mj(n)) (2.40) 
Y PO lxi) 
where 
BO s o j(n — 1) f(x; | j,mj(n — 1),0j;(" — 1) Q.41) 


Yu, ox(n — Df Gi Lk m(n — 1), 0K(n — 0). 


From (2.38) and (2.41) it is clear that condition (constraint) (2.35) is satisfied. 


The likelihood function is nonlinear and exhibits several local optima. The EM 
method is usually doomed with these local optima. The problem related to 
multimodal function optimization can be tackled in many ways (stochastic approx- 
imation technique, simulated annealing, genetic algorithms, etc.). Observe also 
that the likelihood function of normal mixture models is not bounded above [41]. 
In fact, if one mixture center coincides with a sample observation and if the cor- 
responding variance tends to zero, then the likelihood function increases without 
bounds [8]. 


In order to avoid the problem related to local minima, and the problem related 
to the selection of the kernel number, Vlassis and Likas [79] have proposed an 
algorithm based on the total kurtosis [18], which represents an indicator of how 
a Gaussian mixture fits the data. They modify the number of kernels according to 
the total kurtosis. In this method the number of design parameters is increased. This 
algorithm is briefly described below. 
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First, under the assumption that we deal with a Gaussian mixture, the following 
equation holds: 


+00 
f (==) fl pdx =3. (2.42) 
cj 


-—00 


Based on Bayes' rule and using Monte Carlo integration, Vlassis and Likas [79] 
define the weighted kurtosis «; of the kernel j of the mixture by 


S : 
~_1((x; —m;)/o;)P Xi 
j= Ste (Gi f j G | Dow: (2.43) 
$1 P(j | xi) 
For a true Gaussian mixture, xj, j = 1,..., M, are equal to zero. Clearly, «j, 
j =1,...,M, measures the “distance” between the true Gaussian mixture and its 
estimate. It represents, in some sense, an indicator of the quality of the estimate. 


Other authors [46, 80] have proposed to learn a mixture density by maximum 
likelihood in a greedy fashion, by incrementally adding kernels to the mixture up 
to a desired number of components M. In [1] the authors propose to select M 
using stochastic learning automata, by minimizing the Kullback-Leibler distance 
between a Parzen estimate and the mixture model. Stochastic learning automata 
will be discussed in detail in Chapter 3. It is unclear, though, why a rule-based 
P -model environment response was considered in [1], when it seems apparent to 
us that a direct application in the S-model environment would have been more 
interesting. 


Nevertheless, the initial condition can be selected randomly in order to avoid local 
optima, and in practice the number of kernels is relatively small and it can be 
increased or decreased according to the results obtained (this technique looks like 
the algorithm proposed by De Keyser [38] (see also [54]) for the identification 
of the time delay for linear dynamics systems). These simple tools permit one 
to overcome the limitations of the EM method and do not introduce new design 
parameters. 


Section 2.7 is dedicated to the illustration of these algorithms via a numerical 


example. In the next section we shall be concerned with the use of artificial neural 
networks for probability density estimation. 


2.6.7. Neural Networks 


The Gaussian mixture approach encounters difficulties in approximating distribu- 
tions that are not Gaussian, such as uniform distribution, for example. Therefore, 
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it is of interest to consider more general structures for pdf estimation. In this sub- 
section our main emphasis will be on the use of more general model structures 
for density estimation. In particular, we consider multilayer feedforward neural 
networks as a probability density estimator. 


General Approach 
In what follows, a general approach is considered, as outlined by Likas [47]. Let 
us assume the following model structure where the estimate of the pdf is given by: 


g(x,0) 


f(x,0)- f; 2,0) dz' 


(2.44) 


£(x,0) is a model containing parameters 6; to be estimated. The denominator 
ensures the preservation of the probability measure, under conditions that g is 
always greater than or equal to zero, and the integral is strictly positive. 


Let us consider maximizing the likelihood L(0) = []/_, f(xi,0) ofa set ofn obser- 
vation vectors x; defined on a compact subset S$ C R4. This problem is equivalent 
to minimizing the negative log-likelihood /(@), given by 


n 
1(0) = — ^ log f (xi, 0), (2.45) 
i=l 
with respect to model parameters 8. Using (2.44) we can write this as 
n 
1(0) = — Y log g(x;, 0) + n log 1 (0), (2.46) 
i=l 
where we denote /(0) = f s 8(x, 0) dx. The integral can be approximated by 


M 


10) — Y aig(y,9), (2.47) 
l=} 


where y; are integration points. 


Combining the equations (2.46) and (2.47), we have the following approximation 
of the negative log-likelihood: 


n M 
I8) = — Y ^ logg(xi 0) + n log È Ll (2.48) 


i=l i=l 


Estimation of Probability Densities 143 


In order to use gradient-based parameter estimation techniques, the gradient of 1(0) 
with respect to parameters in 0 needs to be calculated. Assuming that the gradient 
of the network output w.r.t. parameters is available (which usually is the case for 
neural network models, for example), the derivatives are given by 


al NUES T 1 Æ Ge 
—(0)= — — (xx, b) + — ai> (yi, 9). 2.49 
96; 0 7 - 2. sta 8) 36H) * f; 2, gg E 


k=1 


Put into words, the minimization of TO) drives the network output to be higher 
at the training points x, and at the same time to be close to zero at integration 
points y;. 


In general, the function g is not positive. Using the transformation 
(x4, 0) = exp(h (xk, 0)), 


it is clear that the output of g always remains positive. The function A(x, 0) is now 
unconstrained, and the probability measure is ensured by a softmax transformation, 
f =exph/ f exp h. Noticing that 3/3z exp(z) = exp(z), it is simple to modify the 
computation of the gradients in (2.49). We have that 


dg (xp) 9^ 
—-(x4,,0) = ^) — (x, 8). 50 
36, (x4,0) =e 36; (xz, 0) (2.50) 


Sigmoid Neural Networks 

Neural networks consist of multiple techniques related loosely to each other by 
the background of the algorithms: the neural circuitry in a living brain. Neural 
networks have attracted the attention of scientists and technologists from a number 
of disciplines. Biologists and psychologists look at (artificial) neural networks as 
possible prototype structures of human-like information processing (in an attempt 
to model the brain circuitry and its behavior). Computer scientists are interested 
in opportunities that are offered by massively parallel computational networks. 
For engineers, the possibility of estimating the parameters of the network using 
(noisy) samples of input-output behavior is of great benefit. These computational 
advantages, combined with the flexibility of black-box identification, make the 
neural network approaches appealing in the fields of engineering, economics, etc. 


One of the most common neural model structures used for function approximation 
are the sigmoid neural networks (SNN). A typical structure is that of a layered feed- 
forward SNN. In such a network, the network topology can be seen as consisting 
of layers of units (neurons), each of which performs a sigmoid function 


1 


1 + exp(—z)’ [590 


a(z)= 
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where z is a weighted linear combination of the input signals received from the 
previous layer. 


For simplicity of notation, let us omit the sample indexes for a moment (we consider 
static functions h(x,@) only, where x is a column vector). The output of a one- 
hidden-layer sigmoid neural network with H hidden nodes is given by [34]: 


H 
Asnn(x, a, 8) = S angaa, Br) + On41, (2.52) 
h=1 


where the basis functions q are given by the sigmoid function: 


1 


——— (2.53) 
1+ exp(— J izi Pnixi — Bh,1+1) 


qn(X, Bp) = 


and x is the 7 -dimensional input vector. The parameters of the network are contained 
in an (H + 1)-dimensional vector æ and H x (J + 1)-dimensional matrix fl. 


The parameters (weights) of the network are estimated from sampled data, typically 
using gradient-based techniques. The gradients with respect to the parameters are 
given by 


OhsNN : OhsNN _ 

dd m qn (X, B4); n 1 (2.54) 
oh 

ESN = anga (x, By) — qux, Ba) Ii (2.55) 

Phi 
ðh 

SNN = anga (X, BA) — qa (X, Bh)), (2.56) 
OBh, +1 


where h = 1,2,...,H andi = 1,2,...,I. 


Let us collect all the parameters o; and f; (to be updated) in a column vector 6. 
A simple parameter update scheme can be obtained by moving the parameters 
iteratively in the direction of the negative gradient of a cost function: 


8J 
6;(k + 1) = 8; (k) — Y 3g; © 


=6(k)+y 


dhsnn d 
36; (k)Ly(k) — $k), 


where y(k) is the desired output of the network at iteration k, and y is the learn- 
ing rate. In the context of neural networks, this algorithm is referred to as the 


Estimation of Probability Densities 145 


backpropagation algorithm. It has been shown by White [83] that it is an applica- 
tion of the Robbins and Monro [63] procedure, namely the stochastic approximation 
technique, in the context of neural networks learning. 


For data sets of moderate size, more efficient techniques can be applied. A com- 
monly used second-order trust region method is the Levenberg—Marquardt method 
(see for example [34]). It is given by the following iterative procedure: 


O(k +1) 2 6(k) -[G"G-- u^! GR, — 4» 0, 


where G is the Jacobian with elements G;,; = dh /80j" (i)e-eq) (gradient of 
the SNN output with respect to the jth parameter in 0, for input data vector x; 
(i = 1,2,..., N), and u is a regularizing parameter. The network is evaluated 
using the previous parameters, from iteration k). The column vector Rcontains the 
errors between the network output and the corresponding targets. In the Levenberg- 
Marquardt method the parameter u is increased whenever the step results in an 
increased value of J = RTR, and reduced otherwise. In the context of uncon- 
strained minimization, this type of algorithm are referred to as is a damped Newton 
method. 


Pdf Estimation Using SNN 

In Likas [47], the model g(x, 6) is a sigmoid neural network function with an 
exponential output node, exp(hsnn(x,@)). The network parameters need to be 
estimated using data. Two “tricks” are suggested by Likas: 


e In order to ensure that the pdf is zero outside the input domain S (or, at least, at 
the boundaries of S), Likas suggests initializing the network (see [47]) by using 
data obtained from a non-parametric (Rosenblatt-Parzen) estimator of the pdf: 


n 


BO) = — Drew (m -= ^) Q.57) 
n(o 4/2n)d 202 ; 


i=] 
with an adjustable width parameter c . The initial parameters 0 (0) of the network 
are then obtained by minimizing the squared error 
M 
E(6) = $ "le(i, 6 — PD (2.58) 


l=] 


over the M integration points, where the gradient of g is obtained using (2.54)— 
(2.56) and (2.50). 
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e The selection of integration points y; is a delicate issue due to the numerical 
integration involved with this approach. Likas [47] ends up proposing a moving 
grid method, where y;; are given by 


yii (t) = yii (0) + di (t), 


where t is the iteration index and dj is a displacement. The displacement 
is selected randomly from U(—h;/2,h;/2), where hj set the density of the 
integration grid. 


Simulation results of the approach are given in the next section. 


2.7. Numerical Examples 


This section presents some numerical examples. First, the basic EM algorithm 
is examined, which adjusts a given number of user-initialized Gaussian kernels 
to fit the data distribution. This is followed by two simulation studies on the 
kurtosis-based EM algorithm, where the kernels are added one by one. Finally, 
a sigmoid neural network is used for approximating an unknown pdf based on 
observed data. 


2.7.1. EM algorithm 


The EM algorithm is simple to implement according to the given formula (see the 
previous sections). In order to determine the number of iterations, the following 
termination criterion was used: 


Li — Li-1 < EL, 


where €z = €|Lo| and € is a user-defined parameter, i is the iteration (EM-step) 
counter, and Lo is the log-likelihood obtained using the initial distribution of the 
Gaussians. The Gaussians were initialized such that the means u; were uniformly 
distributed in the range of data X, with deviations 0; = std(X)/K and equal prior 
probabilities. 


A data set of 5000 points was generated using 0.8N(—3, 12) + 0.2N (2,0.5?). 
Approximation of this mixture of two known normals is shown in Figure 2.18. 
The upper plot shows the original and initial probability density functions, the 
lower plot shows the density estimated with the EM algorithm. The estimate fol- 
lows very closely the original pdf. This was expected, as the original is composed 
of two Gaussians, used as kernels by the EM algorithm, too. 
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Figure 2.18 Probability density distributions. Solid line — original distribution. 
Dotted line — estimated distribution 


2.7.2. Kurtosis-based EM algorithm 


The essence of the kurtosis-based algorithm suggested by Vlassis and Likas [79], 
is implemented by the following pseudo-code. Let the data be given in an input 
vector X: 


Set user-defined parameters (€, &, Imin) and initialize algorithm 
(SETs=1,i=0,K =1,p,; =mean(X),o; =std(X),7, =1). 
Compute Ky and L and set convergence parameters 
(SET ep =e|L| and ex —e|Kr| and SIU = TRUE.) 
WHILE SIU 
EMC = FALSE 
WHILE NOT EMC 
i=i+1; L= L and K = Ky 
EM-step (SET 4j, oj, mj, K, L and KT) 
Examine if EM-iterations have converged (SET EMC) 
END 
Store values of L and Ky (L;-L, {Kr]s = KT) 
Examine if previous split was useful (SET SIU) 
SET s=s+l, i=0 
Split (SET gj, Oj, mj, K, L and Ky) 
END 
Return to previous solution at s-i 


(SET 4j, oj, mj, K, L and Kf). 


148 Stochastic Processes 


The algorithm contains options and choices to be made by the user. The essence of 
the algorithm implementation is described in what follows: 


e The user-defined variables include £ and £ (0.0005 and 0.005), which control 
the criteria used for detecting convergence and efficiency of splitting, and Imin, 
which is a minimum number of EM steps required after each splitting. 

e The EMC is a flag indicating if the EM iterations have converged. EMC is 
considered to be true when i > Imin, |L — Loi] < €z and | KT — K9M| < ex, 
otherwise FALSE. 

e The SIU is a flag indicating the usefulness of the previous split. Once it is set 
to FALSE, no further splittings are conducted. Splitting is enabled when s — 1, 
L, — Ls-1 > @z, or ([Kr]; ^ [Krh-i1 < £x and Ls; — Ls-1 > 0}. Once 
neither of the two conditions holds, the S7U was set to FALSE. 

e The split was conducted by finding a potential kernel for splitting (j = 
arg max j 7 j|«; |). Denote the mean, variance and prior probability of this kernel 
by u*, o* and z*, respectively. The new kernels were then set as uj = u*— 
0",uk41 = A *- 0*,0j —0&41 =0* and zt; — xg = z*/2. 


Figure 2.19 shows the approximation of the mixture of two normals using the 
kurtosis-based EM algorithm. The three upper plots show the evolution of the 
number of kernels (K), the kurtosis measure (KT) and the log-likelihood (L) as 
a function of EM steps. The lower plot shows the estimated mixture probability 
density function. Again, the estimate follows very closely the original pdf. 


In a second example, the Erlang( 1 2) distribution was approximated using a Gaus- 
sian mixture. Figure 2.20 shows the evolution of estimation using the kurtosis-based 
EM algorithm and the estimated probability density function. The log-likelihood 
obtained using the kurtosis-based algorithm (L — —4378) is slightly better than that 
obtained using the EM algorithm (K = 8, L = —4402), which can be addressed 
to the initialization of the kernels. 


These simulations are indicative of results that can be obtained in real situations, and 
demonstrate the efficacy of the EM technique and kurtosis-based EM-algorithm. 
Notice that based on the Kolmogorov efficiency criterion (see Section 2.8), we 
obtain the following results: 


Simulation d A Pr(A) 


0.0931 6.583 0.00 
0.0562 3.973 0.00 
0.0060 0.424 0.99 
0.0060 0.424 0.99 


Wn — 


These results show that the kurtosis-based EM algorithm is more efficient than the 
EM algorithm. Nevertheless, based on Remark 38 (page 157), we shall take into 
account only the good representation as shown by Figures 2.18-2.20. 
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Figure 2.19 Mixture of two normal distributions. Upper plots: Evolution of the 
kurtosis-based EM algorithm. Lower plot: Probability density distributions. Solid 
line — original distribution. Dotted line — estimated distribution 


2.7.3. Maximum Likelihood Estimation Using SNN 


In this subsection, numerical examples of the method suggested by Likas [47] are 
provided. The method is implemented by the following algorithm: 


1. Create a Parzen estimator of the pdf: 
(a) Select the smoothing parameter o. 
(b) Form a Parzen estimate p(x) for points Y, using data X. 
2. Initialize the parameters of a sigmoid neural network (SNN): 
(a) Select the number of hidden nodes H. 
(b) Seti — O, initialize the network parameters « and f, and set initial search 
step size. 
(c) Until AJ < eJp or i > imax, improve æ and f iteratively: 
1 Seti=itl. 

ii. Evaluate SNN output, its gradients, and the cost function J (sum of 
squared errors between SNN output and the corresponding Parzen 
estimator outputs). 

iii. Compute new estimates of a and fi (using gradient-based tech- 
niques, e.g., Levenberg-Marquardt). 
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Figure 2.20 Erlang distribution. Upper plots: Evolution of the kurtosis-based 
EM algorithm. Lower plot: Probability density distributions. Solid line — original 
distribution. Dotted line — distribution estimated using the kurtosis-based EM-algorithm. 
Dash-dot line — distibution estimated using the EM-algorithm 


3. Estimate the parameters of the SNN by minimizing the negative log- 
likelihood: 
(a) Seti — O and the initial search step size. 
(b) Until AL < £Lo ori > imax, improve « and f iteratively: 
i. Seti=i+l. 
ii. Select randomly the displacement of the grid Y. 
iii. Evaluate SNN outputs and the gradients, the integral 7, and the 
negative log-likelihood / and its gradient. 
iv. Compute new estimates of a and f (using gradient-based tech- 
niques, e.g., damped Newton). 


In what follows, we will consider two numerical examples. 
Mixture of two Gaussians and two uniform kernels. The first example is taken 


from Likas [47], where 5000 samples were generated from a pdf consisting of 
a mixture of two Gaussians and two uniform kernels: 


1 1 1 1 l 1 
=-N —-Hhz p 73.577 P 32274 '"]- 
p(x) 4 ( ;) tat 3 DESUG 5i ev (7 z) 
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The integration points Y were created by selecting M = 100 anda one-dimensional 
grid with equidistant points in [—12, 12]. For the Parzen estimator, we selected 
o = (1/1000)(max Y — min Y). Figure 2.21 (top plot) shows the predictions by 
the Parzen estimator gparzen. The given parameters result in a rough estimate of the 
pdf with a log-likelihood of —10 363. 


For the SNN, a network topology with H = 10 was considered to be sufficient. 
Parameters a and f were initialized to small random values. The SNN parameters 
were then trained so as to minimize 


M 
J (Parzen (Yi, X, 0) — esu Gi. e, B)’, 


using the Levenberg-Marquardt method. Fitting the 10 sigmoid functions to mimic 
the output of the Parzen estimator at points in Y gives a smoother estimate (see 
Figure 2.21, middle plot), but it still contains spurious regions. For data in X, a 
log-likelihood of /syv pg = —10 677 was obtained. 
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Figure 2.21 Estimated pdf 5 for a mixture of two Gaussians and two uniform distributions 
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Figure 2.22 Estimated and true pdf's for the Erlang distribution 


Finally, the negative log-likelihood was minimized using both X (samples) and 
Y (integration points), using the damped Newton method (a Levenberg—Marquadt 
type of method for unconstrained minimization). The pdf resulting after 10° iter- 
ations is shown in Figure 2.21 (bottom plot). As a result of training using the 
maximum likelihood method, the SNN pdf estimator evens the heights of the two 
Gaussians. Predictions at the regions where uniform distributions dominate are also 
improved. Also, the regions with pdf close to zero are improved during the ML 
estimation (not distinguishable from Figure 2.21, however). The log-likelihood on 
X improved to IsyN-ML = —10527. 


Clearly, the last of the three estimates showed the best performance. This is also 
reflected by the average log-likelihoods of the Parzen, SNN-PE and SNN-ML 
estimates on 100 sets of independent data generated from p(x): lParzen = —10 719, 
IsNN-PpE = —10893 and Isnn—-mL = —10660. These should be compared to the 
average log-likelihood on the true pdf, 1 = —10 482. Notice that the Parzen estim- 
ator gives better average performance than the SNN-PE estimator, which shows 
that 1) the SNN-PE estimator picked “noise” available in the Parzen estimator 
predictions; 2) the smoothing factor of the Parzen estimator was well chosen. 
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Erlang distribution. The second numerical example considers the Erlang(3, 2) 
distribution (see Section 2.7.2). An SNN network with five hidden nodes was 
considered. Otherwise, the parameters were selected as in the first example. 


In the simulations (see Figure 2.22), the following log-likelihoods on data X were 
obtained: [paren = —4106 , ISNN-PE = —4371, IsNN-ML = —4357, the log- 
likelihood on the true pdf was —4357. This would indicate excellent results. 


Unfortunately, it appears that this illustrates only success in minimizing the negative 
log-likelihood under the given data X and integration grid Y. Instead, the aver- 
age log-likelihoods on 100 independently generated test sets gave Iparzen = —oo, 
IsNN-PE = —4426, Isnn-ML = —4420 and / = —4399 on the true pdf. For several 
sets of data generated under the Erlang distribution, the Parzen estimator gave 0 
likelihoods for the data set to occur (i.e., negative infinite log-likelihoods). This 
was due to the (too-small a) choice of the smoothing parameter c. Again, the 
SNN-PE estimator contains some spurious curves (around the peak of the Erlang 
pdf, see Figure 2.22), and the ML estimation is able to damp the size of these peaks 
significantly. 


2.8. Model Validation 


Model validation is a crucial issue in system identification. It is related to the 
problem of assessing the quality of a given, or estimated model. The classical way 
of validating models is the standard cross correlation test between residuals and 
input (regressor). 


Do not forget that the notion of “good” model is intimately related to the application 
itself (the purpose, the objective to be achieved, for which the model is developed), 
i.e., a model may be good for one application and not good for another application. 
Model validation depends on the prior knowledge of the process to be modelled, 
and it is intimately related to the adopted conformity criterion. 


In the framework of probability densities estimation, we encountered the same 
problem, but here one question arises: is the difference between the theoretical prob- 
ability density due to the limited number of data (realizations) or is this difference 
related to the fact that the adopted probability density function is not adequate? 


There exist many criteria (tests) such as the x? law that permit the estimation of 
the degree of conformity (validity of the selected probability density model) and 
the Kolmogorov criterion [78]. 


2.8.1. The x? Approach 


Let X1,..., Xm be N (0, 1) random variables. Then the sum of their squares, i.e., 


m 


yo, 


izl 
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is distributed according to the chi-squared distribution (Pearson) law with r degrees 
of freedom [53]. The probability density is given by 


1 


0 forx < 0, 


x™/2-1 exp (-3) forx > 0 


where 
oo 
T(x) = J 1*7! exp(t) dt. 
0 


The x? distribution is said to be central!! [45]. The noncentral is the generalization 
of this when the means are different from zero. Its characteristic function is given by 


b(t) = (1 2i 0/2» 
for x? with n degrees of freedom. It has been shown that the random variable 


222 D 

n Jn , 
is asymptotically normal. A more precise approximation due to Fisher [23] 
shows that 


Y, = /2x? — J2n — 1, 


is approximately standard normal. 


The distribution ofthe continuous x? was discovered by Bienaymé [7] and Helmert 
[33] during their studies of least squares. Consequently, Pearson's derivation of x? 
can be seen as a development out of the least squares theory [45]. The x? test is 
described in what follows. 


Let us consider a set of r intervals (classes) (x;,xj+1) and n independent trials. 
The probability p? that the random variable x belongs to the interval (x;, xj+1) can 
be estimated as the frequency of the outcome: x € (xi, xj 4.1). The results of these 


1! A noncentral x? distribution with n degrees of freedom can be decomposed into a sum 
of a noncentral x? with one degree of fredom and the same parameter, and a central x? with 
(n — 1) degrees of freedom, where the involved two variables are mutually independent. 
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trials can be summarized in the following table: 


Ty | (01,42) (42,43) ©- rai, Xr) Gr Xr) 
Pi | Pi p LE Gm Pr 


If probability distribution law is a priori (theoretically) known, the theoretical values 
of the probabilities p;,..., p, can be calculated. In order to adopt or to reject the 
model used for representing the unknown probability distribution, the weighted 
quadratic sum of the errors e; = (pf — pj), i.e. 


r 
J 2 Y wip? - pi)’, 


i=] 


will be adopted as a model validation criterion. The weighting factors w; (i = 
1,...,7) must be selected according to the value of the probability p;. Indeed, if, 
for example, p; is close to | (big), the error e; presents less importance than if p; 
is close to 0 (small). Pearson has selected the weighting factor w; as follows: 


n 
wi = —. 
Pi 
The conformity criterion then becomes 
Z (p? ~ pi)” 
Jan) E (2.59) 


which corresponds to what is called the chi-square criterion and is denoted by x?. 
This criterion, which is used in order to measure the deviation between p; and pi, 
does not correspond to a distance in the strict sense. Indeed, it is easy to see that 
x? (pt — pi) = 0 and equals zero if and only if p? = pj (i = 1,...,r). However, 
the triangular inequality is not satisfied in general. The x? conformity criterion can 
be written in a more convenient and easy-to-use form: 


r 
(np? — npiY 
J= pect a tS C 
20 


isl 


r * A2 
= > (mi np (2.60) 
j=l Pi 


The implementation of the x? conformity criterion is summarized in the following 
three steps: 


1. Compute the value of the x? conformity criterion according to (2.60). 
2. Determine the number of degrees of freedom r (r is equal to the number of 
intervals minus the number of constraints such as ) 7 ., p? = 1). 
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3. Determine from the tables of x? distribution the probability that a given random 
variable distributed according to the x? law is greater than the value of the 
x? conformity criterion computed in Step 1. If this probability is relatively 
small, the theoretical model must be rejected, otherwise the adopted model 
(probability distribution function) represents correctly the distribution of the 
concerned phenomenon (available data). 


The x? goodness of fit test provides an overall test of differences between observed 
and expected frequencies specified by a model. Sometimes it is of interest to exam- 
ine whether the departure from the expected is due to deficiency or otherwise of 
the observed frequency in a particular cell [60]. 


We conclude these development by recalling that Greenwood and Nikulin [29] said: 
“the x? testing remain an art”. In the next subsection we present a validation test 
due to Kolmogorov. 


2.8.2. Kolmogorov Criterion 


In this subsection, we shall present the Kolmogorov validation test (criterion). Let us 
denote by F(x) and F (x) the theoretical probability distribution and its estimation 
(model), respectively. The conformity criterion adopted by Kolmogorov is 


d- max|F (x) — F(x)|. 


This criterion is very simple and can be easily calculated. 


Kolmogorov showed that if the number M of independent observations increases, 
the probability of the inequality 


dV M >i 


tends to 


+00 
Pr(A)=1— Y^ (-D*exp( -27232). (2.61) 


k-—oo 


The implementation ofthe Kolmogorov conformity criterion is summarized in what 
follows: 


1. Calculate d. 
. Determine the value of 4 (A = dA/ M). 
3. Compute the probability Pr(A) using (2.61). If Pr(A) is relatively large, it means 
that the model is good (the random variable considered is distributed according 
to F(x)) otherwise the model must be rejected. 
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Remark 38 The Kolmogorov criterion is mainly applicable if knowledge about 
the probability distribution function is a priori available. 


Rao [60] says: “No consistent theory of testing statistical hypothesis exists from 
which all tests of significance can be deduced as acceptable solutions. In many 
situations, test criteria may have to be obtained from intuitive considerations.” 


Indeed, in most practical problems there is some prior knowledge available that 
gives guidelines on how the model of a probability distribution should be chosen. 
In situations when a very large amount of data is available the problem related to the 
specification of the distribution law and the associated optimization problem may 
occur. In these situations, we can be inspired by the technique used for nonlinear 
processes modeling, namely *Model-On-Demand"[9, 10]. This technique is based 
on ideas from local modeling and database systems technology. The model is built 
on “demand”, as the actual need arises. In general, we believe that many techniques 
developed in the framework of automatic control can be adapted to the development 
of probability distribution models. 


The bulk of the remainder of this chapter is devoted to the use of stochastic 
approximation techniques in probability density estimation. 


2.9. Stochastic Approximation 


Anumber of problems of engineering and the like reduce to determining an optimal 
distribution f*(x) for distributing a limited quantity of resources x from a given 
set X, so that the criterion 


E f O( f(x), x) P(x) = ELO f* G0, x), (2.62) 
x 
where P (x) is the distribution of the cost function ®( f* (x), x), attains an extremum 


(maximum or minimum). Here the condition 


" f'G)dx = [x ertens =a, (2.63) 


x€Xo x 


which characterizes the constraints (resources, etc.), must be satisfied. Here yx, (x) 
represents the characteristic function of the set Xo, i. e., 


_ fil ifxe Xo 
Xxo G0 = t ifx ¢ Xo. 


and o € [0, 1] is a desirable informative level. 


Many problems in reliability theory can be stated in the following manner. Assume 
that we are given some system and let A be a finite set of different components 
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connected in some way. Any element a € A may be in one of two states: in 
the operable state or in a failed state, depending on a variety of random factors. 
The probability F (g(a) < x) that the system is in the state x € X depends on the 
various parameters of the components a € A; as a result, g(a) is a vector whose 
components are, for example, cost g1(a), weight g2(a), volume g3(a), number 
g4(a), of reserve units of type a, etc. 


Ifin a discrete case we let p(x) denote the probability that the system is in state x, it is 
possible to solve a chosen problem randomly (independently of x), in accordance 
with some law of probability in the set of problems that has been defined. In a 
general case, p(x) may be the efficiency of the system. For the quality criterion of 
the system reliability it is natural to choose the quantity 


I = Y p(x) (g(x), x) = E{O(g(x),x)}. (2.64) 


xex 


In this case, the problem of constructing a system that is optimal from the viewpoint 
of the efficiency criterion consists of the following: choose a vector function g(x) 
such that (2.64) is minimal when 


da) <a, — 0020, 


xeA 


since the cost, volume, weight, required energy, etc., are positive and, as a rule, 
bounded. 


Application of an adaptive approach makes it possible to find an algorithm for the 
solution of this problem, a physical realization of the algorithm, and methods for 
using and servicing the system (see for instance Tsypkin [77]). 


In contrast to the problems considered in the preceding sections, here we must deal, 
as a rule, with a conditional extremum instead of an unconditional extremum. The 
conditions on the extremum have a simple character — in the first problem they 
are given as equations, while in the second problem they are given in the form of 
inequalities [77]. 


We will attempt to find an optimal law f*(x) of the form 
f*(x) =e" O(a), (2.65) 


where $ (x) is an N-dimensional vector of linearly independent functions ¢; (x) 
and c is an unknown N -dimensional coefficient vector. The representation (2.65) 
is similar to orthogonal series estimators where the estimation of f*(x) on the unit 
segment [0, 1] is transformed into the estimation of the coefficients of its Fourier 
expansion. Other series estimates, no longer necessarily confined to data belonging 
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to a finite interval, can be carried out using different orthonormal!? sequences of 
functions. The problem under discussion can now be formulated as follows. 


Find an optimal coefficient vector c* that maximizes 
I(c) = E{@(e" g(x), x)}, (2.66) 
under the condition 
ce’ b — o, (2.67) 
where 
b= f XXq(x)O(x) dx. 
x 


We will attempt to find a conditional extremum by means of Lagrange 
multipliers [77]. 


The Lagrange function is 
L(c) = E{(c" o(x), x)} + A(c! b — a). (2.68) 


By setting the gradient of the Lagrange function (2.68) equal to zero, we obtain an 
equation providing an optimality condition: 
VeL(c) = E(VE O(c" (x), x)O(x)} + Ab = 0 (2.69) 
ViL(c) = c! b — a =0. (2.70) 
By applying a regular stochastic algorithm to equations (2.69) and ( 2.70), we obtain 
the following algorithm: 
e(n) = e(n — 1) — yem [Ve O(e(n — 1)7 $60), x()$G()] + Ab, 
(2.71) 
A(n) = X(n — 1) + pr(n)[e7 b — o]. (2.72) 
Recall that many optimization algorithms based on stochastic approximation tech- 
niques have been developed in order to solve constrained optimization problems 


[32, 42, 43, 81]. The algorithm proposed by Walk is not restrictive compared to the 
algorithm proposed by Kushner and Sanvicente. 


12 Qj (x) is a sequence of orthonormal functions if 


[ 5e =] and IL =0, Vi and j. 
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2.10. Conclusion 


The first part of this chapter was devoted to representing various probability distri- 
bution functions, such as the Bernoulli, Gaussian, Cauchy, exponential, Rayleigh, 
gamma, Weibull and Poisson distributions. In the second part, we have presented 
a set of useful techniques for the estimation of density functions. Two techniques 
based on the maximization of the likelihood function were presented, namely, the 
expectation maximization method and a technique using sigmoid neural networks. 
Numerical simulations illustrated the performance of these algorithms. Finally, 
discussion on model validation using the x? and Kolmogorov approaches ends this 


chapter. 
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Chapter 3 


Optimization Techniques 


3.1. Introduction 


In recent decades the efforts of an increasing number of researchers have led to the 
development of a wide variety of numerical methods for optimization purposes. 
Optimization techniques are attracting increasing interest from researchers with 
various backgrounds (statistics, chemical engineering, automatic control, mech- 
anics, maintenance, etc.) and play an increasingly important role in many areas 
(engineering, finance, marketing, insurance, etc.). Many industry analysers believe 
that the emphasis in the near future will be on improving efficiency and increasing 
profitability of existing plants, rather than on plant expansion. 


Optimization techniques stem from diverse approaches, frequently grounded on 
heuristic intuitions, physical or biological considerations. An extensive literature 
on optimization techniques exists [6, 7, 14, 24, 33, 47, 50, 53, 58, 59, 61]. 


In this chapter we shall be concerned with optimization algorithms used in the 
framework of many engineering and economic problems, and with aspects related 
to their implementation. The development of optimization strategies in different 
areas using microcomputers requires algorithms that are both numerically eco- 
nomical and robust. We shall present four techniques for the minimization of 
both unconstrained and constrained problems (cost functions and constraints), mul- 
timodal functions (global optimum) and mixed integer programming problems. The 
algorithms presented in this chapter, namely, stochastic approximation techniques 
(SAT), learning automata (LA), simulated annealing (SA) and genetic algorithms 
(GA), all belong to the class of random search techniques. Random search tech- 
niques lead readily to global search with the advantages of simplicity, efficiency 
and flexibility. 


Let us start by simple consideration of extrema. A function! f(x) attains its 
maximum for x* if there exists o such that 


1. f(x) exists in the interval ]x* — a, x* + o; 
2. xe]x* - a, x* +a[, Vx f(x) € f(x*). 


| [n this chapter, f (x) represents the function to be optimized, not the probability density 
function. 
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This definition shows that a sufficient condition for f (x) to present a maximum is 
that we can find a such that 


1. f(x) exists and is increasing in the interval ]x* — a, x*[; 
2. f(x) exists and is decreasing in the interval ]x* — o, x* + al. 


A sufficient condition for which these conditions are fulfilled is: 


l. x-a «x «x* = 3f'(x) and f'(x) > 0. 
2. x* <x « x* +a = 3f'(x) and f'(x) < 0. 


For the minimum, we can derive analogous conditions. 


The majority of methods for optimization are iterative, in that an infinite sequence 
(xx) of estimates ofthe optimal x* is generated. Even ifit can be proved theoretically 
that this sequence will converge in the limit to the required point, a method will be 
of use only if convergence occurs with some rapidity. 


Let us start this chapter by presenting the basic developments related to stochastic 
approximation techniques in the next section. 


3.2. Stochastic Approximation Techniques 


In this section our main emphasis will be on the use of stochastic approximation 
techniques for solving both unconstrained and constrained optimization problems. 
The recursive estimation procedure, in which after each trial the correction term 
depends only on the realization of this trial and of the previous estimation, is said 
to be a stochastic approximation technique. Pioneering researchers for stochastic 
approximation techniques are Robbins and Monro [52], and Kiefer and Wolfowitz 
[29]. Stochastic approximation techniques are inspired by the gradient method in 
deterministic optimization. They have been widely used to solve many engineering 
problems in the presence of noisy measurements. 


3.2.1. Unconstrained Optimization Using Gradient Measurements 


Let us consider the following estimation problem: Determine the value ofthe vector 
parameter c that minimizes the following function: 


fle) = | Jæ oda = Elo 6.1) 
x 
where J (x, c) is a function unknown explicitly, x is a stationary random sequence 


and F(x) represents the probability distribution function, which is assumed to 
be unknown. The vector parameter c that minimizes (3.1) will be denoted by c*. 
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It corresponds to the solution of the following equation (the necessary condition of 
optimality): 


Vef (c) = Ex{VeJ (x,¢)}, 


where V. f (c) represents the gradient of the functional f(c) with respect to the 
vector parameter c. 


Ifthe function J (x, c) and the probability distribution function F (x) are assumed to 
be unknown, it follows that the gradient Ve f (c) cannot be calculated. The optimal 
value c* ofthe vector parameter c can be estimated using realizations ofthe function 
J (x, c) as follows: 


Cp = Cn—-1 — Yn Ved (Xn, €n—1); Yn = 0. (3.2) 
This is the stochastic approximation technique. 


It has been shown that this estimation algorithm (3.2), converges with probability 
one as well as in the mean squares sense, i.e., 


l 


Pr [ lim (en —c*)= o] 
lim E [ler — ey = 0, 
noo 
if the following hold: 


l. Step-size conditions: 


oo oo 
D Yn = 00 2; (Yn)? < oo, for a.s. convergence 
n= , n=! 


Yn —> 0, for mean squares convergence. 


E] 
A 


2. Convexity conditions: 


inf E,((c — c) V J(x,c)] > 0 
l 
e< lee, e>Q0. 


3. Growth conditions: 
E[VcJ (x, €)? V.J (x,c)] < d(1 + c^ c), d » 0. 


The meaning of these requirements as a sufficient criterion for convergence is very 
simple (see for instance Tsypkin [57]). 


e Requirement 3 imposes a constraint on the rate of increase in the functional, 
and thus on the gradient; the gradient Ve J (c) must not grow faster than a linear 
function of the norm of c. 
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e Requirement 2 indicates that the iterative algorithm must guarantee the 
minimum or the lower branch in the functional. In this case, according to 
requirement 1, y, has to decrease in order to remove the influence of disturb- 
ances, but not so rapidly that a point different from the optimal one is reached. 
When noise is not present, yn can be a constant or a decreasing sequence that 
converges to a constant value. 


The analysis of this optimization algorithm will be presented in Chapter 4. 


An acceleration procedure of the algorithm (3.1) has been analyzed by Polyak and 
Juditsky [48]. These authors have used the stochastic approximation technique 
with averaging: for example, for the case of quadratic optimization, the proposed 
procedure looks like 


Cn = Cn—1 — Y VeJ (Xn, €n-1), yzO0 


n lY. 1 (3.3) 
»» = (1-2) Čn-1 + Qn 


This averaging recursion, which corresponds to a filtering procedure, gives the 
solution of the optimization problem stated above. 


3.2.2. Unconstrained Optimization Using Function Measurements 


Assume that only the measurements of the function to be optimized (but not the 
gradient) are available. In this situation, the so-called random search technique 
should be applied. 


One of several advantages of random search techniques [14] is that these techniques 
do not require a detailed knowledge of the functional relationship between the para- 
meters being optimized and the objective function being minimized, as required by 
gradient-based techniques. They are also simple to implement. Another advantage 
is their general applicability. The body of the SAT algorithm for an unconstrained 
optimization problem is briefly described in what follows. 

Let us consider the following optimization problem 


c* — arg min J (c), (3.4) 


where c*— (c*(1),...,c* (k)) is the parameter vector to be estimated. 
P 


Let c,(i) be the estimate of c*(i) at the step n (nth iteration). The stochastic 
algorithm of interest here is then given by 


Cn+1(i) = cn) — anhi (en), (3.5) 
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where the gain {an} (damping factor) satisfies the boundedness conditions [33]: 


an > 0, lima, =0, $ an= (3.6) 


hi (en) represents the estimate of the gradient of the criterion J (-) with respect to 
estimate c, (i). A simultaneous perturbations form of the estimate of h; (-) at step n 
is given by 


J (Cn + bne(iden) — J (€n — bne(i)en) 


hi(en) = 2 
n 


(3.7) 
where {€n} is a zero-mean independent random sequence, b, > 0 is another step 
size, and e; is a vector of zeros, except for the ith element, which is equal to 1. 
In this gradient estimation, two measurements (evaluations) of the criterion J (-) 
are needed. 


The implementation of this optimization algorithm is very easy and needs little 
memory or computational time. 


3.2.3. Optimization under Constraints 


A key feature of many practical optimization problems is the presence of con- 
straints. For example, inequality constraints arise commonly in process control 
problems due to physical limitations of plant equipment. The control objective may 
be to minimize some quadratic cost function while satisfying constraints of product 
quality, limitations on resources provided by the environment, etc., and avoidance 
of undesirable operating regimes (flooding in a liquid-liquid extraction column, 
etc.). As an example, it is desirable for navigation, recreation or emergency supply 
purposes to always maintain a water level above some minimal acceptable level 
hmin. In what follows, the constrained optimization (control) problem 


min Jo(c) (3.8) 
will be formulated and solved under the constraints 
Ji(c) x 0, (i =1,...,m). (3.9) 


Jo(c) corresponds to the optimization criterion. The constraints Jji(c) < 0, 
(i = 1,...,m) are usually associated with process physical limitations of valves, 
reactor volume, etc. 


Let us introduce the Lagrange function [61] defined by 


L(e, A) = Jole) + 3 Aj Jj), (3.10) 
j=l 
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where A = [A,,..., Am)? is a nonnegative vector of Lagrange multipliers. It is well 
known that such multipliers serve several ends expediently: they are prominent in 
optimality conditions, key to the design of exact penalty functions, etc. 


To solve the optimization problem (3.8)-(3.9), an iterative algorithm based 
on stochastic approximation techniques has been proposed by Walk [61]. This 
algorithm maximizes simultaneously L(c, A) with respect to A and minimizes it 
with respect to c. In this algorithm, the components of the gradients of the Lagrange 
function toward the weights c and the Kuhn—Tucker parameters ® are estimated 
by finite differences. 


Let the estimates (€n, An) be available at time n, where c, are k-dimensional 
random vectors, An = (An(i)), i = 1,..., m, are m-dimensional random vectors 


with A4(i) > 0 (n € N). Let an, bn (n € N) be real positive sequences tending to 
zero and satisfying 


Y alb, < 00, So an=, Danby < 00. (3.11) 


The observation noise (the contamination of function values) is modelled by square 
integrable real random variables V; and V; (D, i = 0,...,m;l=1,...,k;n EN. 


The optimization algorithm? is given by [61] 
An+1 = An + anDgL(Cn, An), (3.12) 


where 


D, L(Cn, An)i = max [im = vi,- | (i —1,...,m)  Q.13) 


and 
Cn] = Cn — an DeL (Cn, An), (3.14) 
where 


(DeL (€n, An))1 = 2bn)" [Jolen + bner) — Jolen — bner) — Ve | 


+ Y AG) b.) Ui(es + bner) 


izi 


— Ji(€n — bner) — Vi]. (3.15) 


e; is a k-dimensional zero vector with 1 as /th coordinate (/ = 1,...,k). 


? [nthe absence of constraints, this algorithm has been considered by Kiefer and Wolfowitz. 


Optimization Techniques 173 
In the case of independent and centered noises of observation, 
E vio] «o Vn,l,i (3.16) 


satisfying 


> ra (vio) «oo Vli (3.17) 
YE lo] <œ Wi. (3.18) 


It has been shown [61] that this algorithm converges almost certainly to the optimal 
solution. 


The convergence of this algorithm has been proved by Walk [61] as well as a 
central limit theorem with convergence order n^V^*, which is also achieved for the 
Kiefer-Wolfowitz method [29] to which the algorithm considered reduces if there 
are no constraints. 


This constrained optimization algorithm has been applied, e.g., by Najim et al. [40] 
for solving a predictive control problem, and Najim and Ikonen [39] for training 
under constraints the parameters of distributed logic processors, respectively. 


3.3. Learning Automata 


The impetus for the study of learning stochastic automata arises because they are 
important models that simulate intelligent behavior of living beings, and useful 
when information of the system concerned is not available or is limited. 


A learning automaton may be considered a system that modifies its control strategy 
on the basis of its experience, in order to reach good control (optimization) per- 
formances in spite of unpredictable changes in the environment where it operates. 
In other words, learning automata should, by collecting and processing current 
information regarding the environment, be capable of changing their structure and 
parameters as time passes to achieve the desired goal or the optimal performance 
(in some sense). 


3.3.1. Learning Automaton 


Let us start by giving a formal description of a learning automaton. An automaton 
is an adaptive discrete machine described by: 


{s, U, R, {En}, {un}, {Pn}, T}, 
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where: 

1. &isthe automaton input bounded set. 

2. U denotes the set (u(1), u(2), .. ., u(N)] of the actions of the automaton. 

3. R=(Q,Ff,Pr) is a probability space. 

4. {En} is a sequence of automaton inputs (environment responses, £, € E) 


provided by the environment in a binary ( P-model environment) or continuous 
(S-model environment) form. 

5. {un} isa sequence of automaton outputs (actions). 

6. Pa = [pa(1), pn(2),-.., Pn(N)} is the probability distribution at time n 


i N 
Pn(i) = Pr{w: un — uG))7n-1) and Y prali) =l, — Vm, 


i=! 


where Fa = o(&},41,P13---3&n,4n,Pn) is the c-algebra generated by the 
corresponding events (7, € F). 

7. Cn = [cn(1), c4 2),..., cn (N )]? is the conditional mathematical expectation 
vector of the environment responses (at time n), i.e., E(£ |J, 1; Un = u(i)) = 
cy (i). In the stationary case, c,(i) = c(i). 

8. T represents the reinforcement scheme (updating scheme), which changes the 
probability vector p, to Pn+1: 


Patt = Pn + YaTa pa; (Ee }e=t,...03 (uiii) (3.19) 
pili) > 90 Vi=1,...,N, 


where y, isa scalar correction factor and the vector 7; (-) = [7 (OFAR TN JT 
satisfies the following conditions (for preserving the probability measure): 


N 
SOT) =0, vn (3.20) 
i=l 

Pai) + ynTi(-) € [0,1] — Vn, Wi = 1,..., N. (3.21) 


Let us consider a very simple example. 


Example 53 (Pigeon in a cage) 4 pigeon was placed in a cage (see Figure 3.1) 
with a black and a white disk mounted on the walls. If, by chance, the pigeon pecks 
at the black disk, it receives a certain amount of grain. If it pecks elsewhere (white 
disk), it receives no reward or some small electrical impulse (penalty). 


e Environment: Cage + feeding system. 
e Automaton: Pigeon. 
e Actions: Pecking at black disk u(1), pecking elsewhere u(2). 
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Figure 3.1 The pigeon S cage 


Learning automaton 


Figure 3.2 Learning system 


e Environment response: £n = 0 (the pigeon is provided with grains), £y = 1 (the 
pigeon is not provided with grains). 

e Probabilities: p(1) is the probability for taking the action u(1), and p(2) is the 
probability for selecting the action u(2). 


Based on the probabilities p(1) and p(2), the pigeon selects randomly one action. 
After a learning period, the pigeon will learn to peck at the black disk, i.e., 
pn (1) — 1, where n represents the time index. 


A learning system consists of an automaton connected in a feedback loop to the 
environment where it operates (see Figure 3.2). Learning deals with the ability 
of systems to improve their response based on past experience. The environ- 
ment establishes the relation between the actions of the automaton and the signals 
received at its input. It includes all external influences. The environment produces 
a random response whose statistics depend on the current stimulus or input. 


The reinforcement scheme is the heart of the learning automaton. A number of 
different reinforcement schemes have been proposed in the literature. A reinforce- 
ment scheme can be linear or nonlinear; discrete or continuous; reward-penalty 
or reward-inaction, etc. Sometimes it is advantageous to update p, according to 
different schemes depending on the intervals in which the value of p, lies. Notice 
also that there exist not only single-automaton learning systems, but different 
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structures of learning automata (hierarchical structures, automata with a changing 
number of actions, cooperating or noninteracting teams of learning automata, etc.). 


The loss function c, associated with the learning automaton is given by 
1 n 
, = — » (3.22) 


It is a useful quantity for judging the behavior of a learning automaton. 
The learning algorithm operates as follows: 


e Step 1. Choice of one action. A simple technique to select one action among 
N possibilities is based on the generation of a uniformly U (0, 1) distributed 
random variable £. The algorithm chooses the action u(i) such that i is equal 
to the least value of j, verifying the following constraint: 


P Pali) ae. 
j=} 


e Step 2. Calculation of the (averaged, normalized) value of the realization of 
the function to be optimized. 

e Step 3. Use of this response and a reinforcement scheme to adjust the 
probability vector pn. 

e Step 4. Return to Step 1. 


Let us, using a simple example, have a look at how to formulate the problem related 
to dryer control using learning automata. 


Example 54 (Control of a dryer) Let us consider a dryer. Our objective is to find 
the value of the control variable (energy) in order that the moisture content will be 
equal to a given desired value hq. It is well known that the evolution of the moisture 
content as a function of the energy has an exponential profile that becomes very 
flat for small values of the moisture content (see Figure 3.3). Let us assume that the 
energy belongs to the interval [Emin, Emax]. This interval will be discretized into 
N sub-intervals. The center of each sub-interval i will be considered as control 
variable, and will be denoted by u(i). At each time n, the learning automata selects 
an action un € {u(1),u(2),...,u(N)}. 


A very simple control strategy for the dryer system can be described as follows: 


h > hg and un > un—1, £j = 0 (reward) 
h > ha and un < us, En = 1 (penalty or inaction) 
h < hg and un > uj, En = 1 (penalty or inaction) 
h < hg and un < uj i, £y = 0 (reward). 


WN 
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Figure 3.3 Moisture content as a function of the energy 


With each action u(i) is associated a probability p(i). If an action leads to a reward, 
its probability is increased and the other components of the probability vector are 
decreased in order to ensure the probability measure. 


This example shows how learning automata can be implemented for control 
purposes on the basis of "if-then" rules. 


The use of learning automata for solving both unconstrained and constrained 
optimization problems will be presented in the following subsections. 


3.3.2. Unconstrained Optimization 


Let us consider the function f (x) which is a real-valued function of a vector para- 
meter x € X where X is a compact in R”. We would like to find the value x = x* 
that minimizes this function, i.e., 


x'—arg min f(x). (3.23) 
xeXcmM 


Notice that there are almost no conditions concerning the function f(x) 
(continuity, unimodality, differentiability, convexity, etc.) to be optimized. We are 
concerned with a global optimization problem of multimodal and nondifferentiable 
functions. Let y, be the observation of the function f (x) at the point x, € X, i.e., 


Yn = f (Xn) + Wn, (3.24) 
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where wp is the observation noise (disturbance) at time n. We shall transform this 
optimization problem on the continuous set X into an optimization problem on 
a discrete set, which will be defined in what follows. 


Quantization Consider now a quantification {X;} of the admissible compact 
region X C RM: 


XicX (3.25) 
Xj{)Xi=8, i,j =1,...,N (3.26) 
isj 
N 
| Jic eX CRY, (3.27) 


i=] 


where x, € {x(1),x(2),...,x(N)} := X, x(i) € X;. Here the points x(i) are some 
fixed points (for example, the center points of the corresponding subsets X;). wn = 
w,(w) is a random variable given on a probability space (Q, 7, Pr) (o € Q -a 
space of elementary events), characterizing the observation noise associated with 
the point X,. 


The next lemma (see Chapter 3, Lemma 1 in (51]) states the connection between 
the original stochastic optimization problem on the given compact X, formulated 
above, and the corresponding stochastic optimization problem on the finite set X. 


Lemma 5 Let us assume that on each subset X; the optimized function f (x) is 
Lipschitzian, i.e., there exists a constant Lj such that for any x' and x" € X; the 
following inequality is fulfilled 


|f - f| < Li |x -x"]. 


Then, for any € > 0 there exists a quantization (Xj) (i = 1,..., N) of the admiss- 
ible compact region X C. RM such that the minimal values of the optimized function 
f (x) on the compact X and on the discrete set X differ by no more than € (accuracy). 
Moreover, for this purpose, the quantization number N must satisfy the following 
inequality: 


max Li, (3.28) 
where D :— sup, yex lx — yll is the diameter of the given compact X. 


We shall consider three reinforcement schemes for solving unconstrained 
optimization problems. 
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Bush-Mosteller Scheme 

In this subsection, we will focus on the Bush-Mosteller scheme. It is a lin- 
ear reward-penalty reinforcement scheme. The Bush-Mosteller scheme [9] is 
commonly used for the adaptation of the probability distribution pn. 


For the P-model environment, £, € (0, 1), we can write the following algorithm: 
If at instant n the action u(k) is selected, and the environment response is £, = 0 
(rewarding), the probability update is given by 


Pn+1 (k) = pa(k) + BO — pn(k)) 


1- n k å 
PD = p-p jgn 
If &, = 1 (action should be penalized), then 
Pn+itk) = pn(k) — Bpa(k) 
Gre g- pen T 
Pn+1 J = Pn J N T 1 J > 


where £ is the learning coefficient, B € [0, 1]. 


Observe that when an action leads to a reward (penalty), its associated probability 
is increased (decreased) and the probabilities associated with the other actions are 
decreased (increased) in order to preserve the probability measure. Notice that in 
the original version of the Bush-Mosteller scheme, the correction factor (denoted 
here by yn) is constant (6 = cte). Observe also that the environment response need 
not be restricted to binary actions ( P-model), but can extend to the unit interval 
(S-model), i.e., En € [0, 1]. 


For deriving the asymptotic properties of recursive algorithms based on learning 
automata, it is more convenient to rewrite the reinforcement schemes in a vector 
form. The Bush-Mosteller scheme [9, 38] can be rewritten as 


N aS] 


e 
Pn+1 = Pn t Yn [et — Pn ct ín N-1 (3.29) 


where 


Yn € [0,1], En € [0, 1] 
e(u4) = (0,...,0,1,0,....,07, un =u(i) 
=< 


e" =(1,...,1)7 em". 


Shapiro-Narendra Scheme 
The Shapiro-Narendra scheme is a linear reward-inaction reinforcement scheme. 
The Shapiro-Narendra rules are given by the following: For the state i selected at 
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instant n: 
Pn) = paG) + y (À — &) — pr(i)). 
For all other states j: 


Pa (j) = pa) —- YC — En) pn), 


where y is the learning rate. It is clear that whenever a penalty is received (£n = 1), 
the probabilities remain unchanged (inaction). As with the Bush—Mosteller scheme, 
the environment responses in the Shapiro—Narendra scheme need not be restricted 
to binary actions but can extend to the unit interval. 


MeMurtry-Fu Scheme 

The McMurtry-Fu scheme operates as follows: At each time, based on the prob- 
ability distribution p,, an automaton selects an action un = u(i) (x (i)) randomly, 
and a realization of the function to be optimized is carried out: y, = y(x(i)). 


Let us consider the following averaging technique 


Zn+1 (i) = aza (i) + (1 — @)Zn41 (i) (3.30) 
zag)-u) J Fi. (3.31) 
0 < æ < 1, where z,41(i) represents the realization of the following cost function: 
HN 
Za41) = (=) , (3.32) 
Yn 


where y = cte > 0; H = cte > 0, and y, corresponds to the response of the 
environment at instant n. 


Remark 39 (Exponential weighting) Equations (3.30)-(3.31) show that the 
(r + s)th measurement is weighted a^? times heavier than the rth measurement. 
When a proper choice of a is made, an appropriate emphasis may be placed on 
recent and past observations. Ifa = (1 — 1/(n + 1)), this model simply represents 
the standard averaging process. 


This averaging procedure will be used as a reinforcement scheme. The probabilities 
associated with the actions of the learning automaton are assigned as follows: 


mardi. quip. UNS (3.33) 


Zn+1 


where 


N 
Zi = Do ma). (3.34) 
i=} 
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Remark 40 (Inverse proportionality) Since z, (-) is inversely proportional to 
f (9 (see equation (3.32)), lower values of f (-) result in larger zn (-), and therefore 
in a higher probability p, (-). 


One further comment should be made. 


Remark 41 (Relation to Bush-Mosteller) Notice that when the correction factor 
tends to zero, the well-known Bush—Mosteller linear reward-penalty scheme leads 
to a probability distribution proportional to the average losses (see Section 3, 
Chapter 3, in Najim and Poznyak [38]). 


The function to be optimized is, in general, not positive. As a consequence, the con- 
dition on the positivity of the probabilities p,..1 (i) is not guaranteed. A projection 
algorithm ensuring the probability measure is discussed later in this section. 


Normalization 

When we deal with optimization problems, the realization of the criterion to optim- 
ized corresponds to the environment response. In general, this response does not 
belong to the unit segment (notice that belonging to the unit segment is commonly 
required by the reinforcement schemes, in order to preserve the probability meas- 
ure). To avoid this problem, we will consider the following general normalization 
procedure, which has been developed by Poznyak and Najim [51]. At least as 
importantly, this normalization scheme performs averaging of the environment 
responses in time, i.e., it provides estimates of the expected environment responses 
for each action of the automaton. 


In many applications, this very simple normalization procedure has been revealed 
to be very efficient. The schematic diagram of a learning automaton operating in a 
random environment with normalized responses is given in Figure 3.4. 


Normalization 


Learning automaton 


Figure 3.4 Learning system with normalized environment response 
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The automaton input (environment response) is then normalized as follows: 


E ux [sn ) — min; s4-1()].- 


^77 maxela (k) — min; ssi +l [0,1], | u.-uG) QG35 


where 
: ei &rx (ur = u(i)) “ 
n(i) = eS =1,...,N 3.36 
xii Moi x = uli) 228) 
and 
jx ifx>0 
[x], = ls if x <0 (337) 


2 ony fl if un m ui) 
XC, = u(i)) := li if un x u(i). 


It has been shown that the normalized environment response E n belongs to the unit 
segment [38]. 


Projection 

The condition on the positivity of the probabilities p,+)(i) is not obligatory 
guaranteed by all reinforcement schemes. To avoid this problem, a projection 
operator $2 onto the simplex 


N 
$,: ip Pop mp mn-0b, i=l, N (3.38) 


izl 


can be used. The algorithm to be discussed is particularly elegant in that it provides 
recursion of the probability vector in order to preserve the probability measure. 
Notice that in many problems like parameter estimation in process modeling 
(stability ofthe estimated Hammerstein and Wiener models, etc.) [24], optimization 
under constraints (Lagrange and penalty functions approaches), the implementa- 
tion of a projection procedure is necessary in order that the estimation belongs to 
the constraint set. Observe that the projection is not easy to compute unless the 
constraints are linear [33]. 


Remark 42 In stochastic optimization with decision-independent probability 
problems, i.e., 


Fix) = f tro.) a0) — min, xeScR', 


if a stochastic gradient of h(x,w) and F(-) in the sense 


OF(- 
E thia, )) = ŽO 
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exists, then this problem can be solved by the constrained Robbins — Monro 
algorithm using a procedure that ensures the projection onto S, which is assumed 
to be convex and closed [36]. The projection onto S can be easily calculated (see 
for instance Chapter 6 in [15]) 


75(Z) = arg min{||z — x|| : x € 5). 
If S is a hypercube a < x <b, then 


zs(z) = max{a, min [x, b]:x e S). 


The algorithm related to the projection procedure is presented at the end of this 
section. Its Matlab mechanization is given in Appendix B. An extended version of 
this random search algorithm has been given by Najim er al. [41]. 


We have now presented the main components required for the application of 
learning automata based optimization. Simulation examples on optimization under 
constraints are given after the next subsection. 


3.3.3. Optimization under Constraints 


The learning automata operating in a random environment can be used to solve 
constrained optimization problems. This can be a difficult task, however, because 
the objective related to the optimization problem must be rephrased in the form of 
rules involving the Boolean operators. This leads to the construction of some kind 
of rule-based (if premise then consequent) expert system that gathers up a deep 
description (a fundamental kind of “dissection,” “husking”) of the optimization 
problem to be solved. 


Notice that theoretical analysis (convergence and estimation of the conver- 
gence rate) of learning automata operating in binary and continuous environment 
responses [38] exist, which is not usually the case with rule-based techniques. 
The aim of this subsection is to present a constrained optimization algorithm based 
on learning automata in which the criterion and the constraints are taken directly 
into account. 


Projectional Algorithm 
Let us consider an automaton with N actions. The optimization problem to be 
solved is formulated as follows: 


N 
y Elo lijp — inf, (3.39) 
pesN 


i=l 
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subject to M constraints 


N 
YES l)p@<0, —j-2L....M, (3.40) 


i=l 


where Jo represents the realization of the cost function to be minimized and 5" is 
the N-dimensional simplex, i.e., SY = (p € RY: p(i) > 0, Em p(i) = 1). The 
constraint E{J; | i) is the conditional expectation subject to the condition that one 
of the N actions (events), namely, the ith event, has occurred (the ith action has 
been selected). All values E(Jj | i} j = 0,1,..., M are assumed to be constant, 
which corresponds to the stationary situation. 


To solve this optimization problems (3.39) and (3.40), we shall introduce the penalty 
function: 


N 
L(p.3) = Y E{Jo Vi) pG) 
i=] 
2 


AQ) x * 
ir zuo] , (3.41) 


i=l 


where the operator [.]* is defined by 


+_ Jl ifx>0 
ui lo ifx «0 
and A(j) (j — L,..., M) represent the penalty multipliers. The optimization 
problem defined by (3.39) and (3.40) is equivalent to the minimization of the 
penalty function given by (3.41), i.e., 


inf L(p,A) (3.42) 
pes 


for large enough penalty multipliers A. 


To solve this stochastic programming problem, we shall present an appropriate 
reinforcement scheme (probability updating procedure), proposed and analyzed by 
Poznyak [49]. The asymptotic analysis (convergence with probability 1) has been 
derived on the basis of the Lyapunov approach and martingale theory. 


Let u(i) be the action selected by the automaton at time n. Based on the environ- 
ment response, the probability distribution is adjusted using the following recursive 
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procedure: 


Puoi) = Ms{Pn(i) — y (n)g(n, dni, pa/i)) 
Pn+1 (j) = Hs(pa) + y (n)g(n, dni, ps /i/(N — 1))}, j£*i 


N N sa 
2% dy-1, P) = Jo (7) + Lao par) ;(*) 


E ] x , t 
dn(ji) = v.d) 3 x (u(t) = u())Jj (;) 


= dyd - ZEC EO Fa, in - 4 ()] 


va) 


di(js) = dijs), szj 


v. () = Yo xua) = u(i)) = va-1G) + x (u(t) = u(i)). 


t=] 


Ils represents the projection operator onto the simplex S" in order to preserve the 
probability measure, and J;(n/i) (j = 0,1,..., M) are the realizations at time n 
of the cost function (j = 0) when u(n) = u(i), and the corresponding constraints 
(j =1,...,M). 


Observe that the first two equations correspond to Bush—Mosteller reinforcement 
scheme. g(n, d, ., Pn/i) is the realization of the gradients of the penalty function 
with respect to the vector p, and d, (ji) represents the estimation of the constraint 
functions E (Jj | i). 


This optimization algorithm has been successfully implemented by Najim [37] for 
the constrained control of a rotary phosphate dryer, which constitutes an important 
source energy consumption. 


Projection Procedure 

The projection procedure plays an important role in many algorithms based on 
learning automata. Before presenting the algorithm performing the projection of 
a given vector p onto the simplex S, let us consider the following simple example. 


Example 55 (Projection) Suppose that at time n, for an automaton with two 
actions, we obtain the following probability vector: 


pa (1) = 0.7, Pn(2) = 0.9. 
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Figure 3.5 Geometrical interpretation of the projection procedure 


which will be associated with the point A. It is evident that this vector does not 
belong to the simplex (pn(1) + pn(2) = 1.6). The projection consists of finding the 
less distant point B from A belonging to the segment CD represented in Figure 3.5. 
The coordinates of B are (0.4, 0.6). 


Now we shall present the mathematical tools related to the projection onto a simplex. 
The projection operator IT onto the simplex S 
N 
S={Pn: > pu ml p.)z0, 1 =1,...,N 
i=l 


is a contraction defined as 


z ifzes 
s(x) = iz* ifz¢&s, llz — z*|| = min Iz — yil. 
ye 


It is evident that the following property holds: 


Pn — T1s(q)Il < lips — qll- 


The simplex consists of 


€ — N vertices rl 
Ck, = 4N(N — 1) edges (two-dimensional faces) I7. 


ENT = N(N — 1) dimensional faces pls 
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The point for which the coordinates are all equal to zero (except the jth, which 
equals 1) is the vertex Ri The face 7, (m > 2) is the subset 


rA = (Pr: X € Dm, pG) = 0) 
of one of the hyperplane D,, defined as follows: 
N 
D,—p Y pi) =1, Pa) 20, i=1,..,N. 
i=] 
The projection of p, is defined as follows: 


Ts(Pn) = Ph: Pn — Ph |l = min lpr — yll. (3.43) 


It is obvious that pz € rý for a certain k. Note that finding pz = IIs(p;) is 
equivalent to finding the point on the simplex S that is closest to the projection 
pa (Dy) of the point p, onto D,. 
From the definition (3.43), it follows that 
minlly — p;ll = minll(y — Pn(Dn)) + (ps (Ds) — pr) Il 
= ||(Pa (Dn) — pf) I + ll (y — Pa (DrD)? 


The following lemma [43, 51] is the basis for the development of the recursive 
algorithm for implementing the projection operator. 


Lemma 6 The face Uy closest to the point p& (D4) has an orthogonal vector 


ay-1 = (1,1,...,1,0,...,1). (3.44) 
oe 
J 


The index j corresponds to the smallest component of the vector (point) py (D,), i.e., 
j- fi: (Pr (D4)); = min es»). : (3.45) 


The projection procedure is accomplished in a number of steps not exceeding N. 


The projection onto the simplex S; defined by 


N 
Se = pi Dra) = lead) > e»t]. peu 


i=) 
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can be carried out by means of a change of variables: 
Pn(i) = p) +€ 
The structure of the projection operator algorithm is indicated below. 


If p, ¢ S, then 


1. find p,(D,) according to (3.45) for m = N; 
2. ifpn(Dm) € S, then find the smallest component of the vector that is orthogonal 
to the closest face rn (3.45). 


The Matlab mechanization of the projection procedure is given in Appendix B. 


In the next subsection, two detailed applications of learning automata are given. 


3.3.4. Applications of Learning Automata 


In this subsection we will give two examples ofthe application of learning autornata 
for solving optimization problems. The first one is a straightforward illustration of 
the learning automata paradigm in maintenance optimization. Taking into account 
that the domain of learning systems is wide, it is natural to connect several learning 
techniques in order to benefit from the advantages of different schemes and, con- 
sequently, develop efficient hybrid optimization techniques. An example of such 
a hybrid scheme is presented in the second part of this subsection. Let us start out 
by an application of learning systems to a maintenance optimization problem. 


Maintenance Optimization Based on Learning Automata 

Physical systems are designed and built to perform certain defined functions. They 
become more and more complex. This complexity, and the commercial pressures 
to increase productivity induce the development of new and suitable mainten- 
ance strategies in order to improve the performance of a given system, i.e., to 
increase its availability and to decrease the operating costs of repairable systems. 
In other words, the development and implementation of new maintenance policies 
are now of central importance because they can significantly reduce the occurrence 
of breakdowns and improve the performance of the system. 


A maintenance strategy can be defined as a decision rule which establishes a 
sequence of actions to be undertaken with regard to the operation state of the 
considered system. Each state reflects the degree of deterioration of the system. 
Many maintenance strategies have been proposed and studied extensively in the lit- 
erature; see, for example, [3, 4, 5, 12, 22, 42, 46, 56]. In the survey by Valdez-Flores 
and Feldman [60], over 120 papers are cited. 


The modeling of a given maintenance strategy can lead to an analytical model that 
is usually difficult to solve. A cost and a duration are associated with each action. 
The performance of each strategy is evaluated in terms of the mean of the total 
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cost per time unit over an infinite horizon, or in terms of stationary availability. 
Numerical procedures have been commonly used in order to evaluate the validity 
of analytical models or for evaluating the performance of a given maintenance 
strategy. An overview of the methods developed and used since 1977 for solving 
reliability problems is given in Kuo and Prasad [32]. 


In this study we consider an optimization approach based on learning automata, 
in order to deal with the treatment of analytical models describing a given main- 
tenance strategy [25]. In particular, the optimization problem stated by Zhang et al. 
[63] is considered, related to the optimal replacement policy for a deteriorating pro- 
duction system with preventive maintenance, assuming that the system’s operation 
follows a geometric process. In practice, the probability laws of the deterioration 
(degradation) and failure processes are usually estimated under ideal laboratory 

conditions by the manufacturer of the device. In real conditions, they depend on 
the environmental conditions of the process, and/or do not remain constant dur- 
‘ing the useful life of a given device. In these conditions, it is important to have 
optimization algorithms that use only the realizations of the criterion to be optim- 
ized. In this subsection, optimization algorithms based on learning automata are 
suggested for this requirement. 


In this application, the continuous S-model modification of the P-model linear 
reward-inaction Lpy scheme (Shapiro and Narendra [55]) is considered (Poznyak 
and Najim [38]). The modified scheme can be seen as an estimator algorithm, as a 
sample means of the environment responses for each action are maintained. 


Preventive Maintenance | In what follows, we shall be concerned with the problem 
related to the optimal replacement policy for a deteriorating production system with 
preventive maintenance (PM), following the formulation by Zhang and coauthors 
[63]. The optimal replacement is expressed in terms of the accumulated number of 
failures that the system has experienced. The provision of preventive maintenance 
is incorporated in the system model and the objective functions' cost efficiency. 
The basic concept parallels the geometric process replacement policy N introduced 
by Lam [35]. In the numerical example presented in Zhang et al. [63], it is assumed 
that the distribution function of the life of the system after the mth preventive cycle 


(xi) for each m is exponential. 
A cycle is defined as the time interval between the completion of the (n — 1)th 


repair and the completion of the nth repair of the system. In Zhang et al. [63], the 
following model for the cost during one renewal cycle is given: 


N-I N wn N 
D(N) = | cE to} +o an eecee [pe 
nal 


n=! n=] j=l 


EXP x yom exh] j] PI Unt + Xn” Xon al} (3.46) 
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which consists of the sum of failure repair costs, costs of the preventive maintenance 
(PM), renewal costs, and the working reward, divided by the system working time. 
The system lifetime, xU ), at cycle n after the jth PM, is an iid random variable 
drawn from a cumulative exponential distribution: 


an! 
F, (t) = F (a) =] -( = J 


If the system does not fail until time t, PM is conducted and the system is put to 
work again; if the system fails, it is repaired or replaced. After PM, the system 
is as good as at the beginning of the cycle n. System repair occurs after v, PMs, 
where v(n) is a random variable. After repair, the system is not as good as new, but 
deteriorates according to the geometric process F4. The system replacement ends 
the renewal cycle, and is conducted after N repairs. 


The system repair and maintenance times increase after each cycle n. The failure 
repair time at cycle n, Y,, and duration of the PM, zu ) , at the jth maintenance in 
the nth cycle, are iid random variables drawn from geometric stochastic processes 
with cumulative distributions G, = Gtt) and H, = H (37'e) with means 
E(Yi] = 4| and E(Zi) = u2, respectively. The costs per time unit are given by 
constants c f and cp”. 


The model (3.46) is deterministic. In order to simulate real-life realizations, two 
modifications are considered. In the first modification, D2(N), realizations were 
drawn from a stochastic process obtained from (3.46) by replacing the constants 
Cf, Cp, € and cy by Gaussian-distributed random variables with means c f, cp, c 
and cy, and variances o " ej, c? and oà, respectively. These represent fluctuations 
in the costs (e.g., due to fluctuations in the currency rates, changes in prices of the 
product at the market, of petrol, spare parts, etc.). A second modification, D*(N), 
was obtained by removing the expectations in (3.46). The main rationale behind 
this model was to extract the stochastic processes “underlying” in the model of 
Zhang et al. [63]. The optimization is then based on the realizations drawn from 
these processes at instants k — 1,2,.... 


Figure 3.6 illustrates the behavior of the realizations from D, D? and D?. The plots 
show sample means and quantiles (q = 0.1, 0.2, ..., 0.9) of 3300 realizations taken 
from the stochastic process D? (upper left) and D? (lower left) as a function of the 


3 The following constants were used by Zhang et al. [63]: ratios of geometric process 
b, = 0.85, b2 = 0.95 (system repair and maintenance times), a2 = 1.0005 (system life), 
expectation of the system life after PM Az = 10, expectation of the failure repair time 
ft, = 40, expectation of duration of the PM 45 = 100, reward of the system per unit 
time cy = 200, failure repair cost per unit time cf = 5, PM cost per unit time cp = 1.2, 
replacement cost c = 5000, and fixed time interval between two consecutive PM t = 720. 
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Figure 3.6 Cost functions. Top left: sample mean and 0.1, 0.2, ..., 0.9 — quantiles for 
realizations from process D. Bottom left: same for D?. Right: sample means for 
realizations from processes D? and D? 


accumulated number of failures N . The noise-to-signal ratio is large, and the process 
D? is strongly heteroscedastic with a skewed distribution. The sample means of 
D?(N) (dashed line) and D(N) (solid line) are shown as a function of N in the 
right plot of Figure 3.6. Notice that E(D?(N)) = D(N) but E{D3(N)} z D(N); 
in particular, significant differences appear with small values of N. With all cost 
functions, the minimum is located at N = il. 


Optimization Problem The main problem treated in this subsection may now be 
stated. The optimization problem is to find the optimal replacement policy N*: 


N* — arg min D(N). (3.47) 


This is the optimization problem considered in [63], where the system is optim- 
ized using (deterministic) D(N). Two stochastic optimization problems can be 
formulated using realizations from the stochastic processes D? and D?: 


N* = argmin E [re] jud (3.48) 


In practice, this could represent the following optimization in an on-line situation: 
First, start the learning process (initialize the learning algorithm). Then select Ng 
randomly, using a probability distribution (Step 1 in the learning algorithm). Run the 
process until it has failed N; times (measure the realizations of X Cn) zÜ ) and Yn) 
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(Step 2) and evaluate the cost function D' (N) using measured data (Step 2). Pro- 
ceed by updating the probabilities, etc. in the optimization paradigm (Steps 3—4), 
set k := k + 1 and return the selection of a new Nx. As the number of iterations 
increases, the optimization routine should converge to select only the optimal 
Ny = N*. inthis subsection the learning automata approach is suggested for solving 
this optimization task. 


Learning Algorithm The learning algorithm is given by the following: 


1. Generate an action i, (length ofthe renewal cycle Ng) based on the action prob- 
ability distribution px. The technique used by the algorithm to select one action 
among M possibilities is based on the generation of a uniformly U (0, 1) distrib- 
uted random variable z. The algorithm chooses the action i such that i is equal 
to the least value of j, verifying the following constraint Èi- px) > z. 

2. Apply the action to the environment and obtain the environment response 
&(D' (Nx)). 

3. Normalize the environment response to the unit segment. The following 
normalization scheme is applied: 


Y. 6x (ir =i) 


sk) = ; 
PME S Y dim 


where x (i; = i) = lifi; = i, x (i; = i) = 0 otherwise; 


T,= Sk (ix) — min; sg (i) 
^ max; sx (i) — min; sk (i) +8’ 
where ô is a small positive constant (to avoid division by zero) and min and max 


are taken over all i for which Si X (à =i) > 0. 
4. Update the action probabilities using the Shapiro-Narendra rules: 


if the selected action iy = i: py 1() = px) + y (1 — £,) (1 — px G)) 
for all other actions j # iy: yai) = pxG) — y (1 — £4) Pe), 


where y is the learning rate. 
5. Setk:— k + land go to step 1. 


Simulation Experiments — In this subsection an illustrative computer simulation 
example is provided to demonstrate the validity and the efficiency of the suggested 
optimization algorithm. These simulations are indicative of results that can be 
obtained in real situations, and demonstrate the efficacy of learning automata in 
solving optimization problems related to reliability. 


A series of computer simulations were conducted, in which 50 repeated simula- 
tions on all combinations of three constant learning rates (y € (0.10, 0.05, 0.02}) 
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Table 3.1 Percentage of correct responses 


y Model D(%) Model D? (%) Model D? (%) 


0.10 90 86 76 
0.05 100 98 88 
0.02 100 100 100 


Table 3.2 Average number of iterations at 
convergence 


Y Model D Model D? Model D? 


0.10 3100 3200 3300 
0.05 7200 8200 9000 


0.02 21500 22100 24100 


and the three models D, D? and D? were made. The action set was selec- 
ted as U — (5,7,9,11,13,15], where the optimal action for all three models is 
u = U(4) = 11. For model D?, the standard deviations of the costs were set to 
1 1 1 1 ; 3 mi Qe 
WCS WEP 10€ and j5Cw, respectively. For model D^, uniform distributions in 
[0, 2u;] (i = 1,2) were assumed for H and G. 


The averaged results of the simulations are shown in Tables 3.1 and 3.2. Table 3.1 
shows the average percentage of converging to a correct response, for each learning 
rate experimented (rows) and each model (columns). Table 3.2 gives the average 
number of iterations at convergence. 


Convergence was assumed when any of the action probabilities was greater than 
0.99. As expected, a larger learning rate resulted in faster convergence, but at 
the cost of decreased percentage of converging to the correct optimal solution. 
For the deterministic model, a learning rate y = 0.05 gave correct results in all 
50 simulations; the result was obtained on average in 7200 iterations. For the 
noisy case — model D? — the same learning rate provided 98% correct; for the 
stochastic model D? only in 88% of the simulations was the correct solution found. 
The learning rate y = 0.02 provided 100% correct results with all models. With 
a sufficiently small learning rate the convergence to an optimum is guaranteed, 
but the number of iterations may become unfeasible. In general we can conclude, 
however, that the LA-based optimization was relatively insensitive to the significant 
noise present in models D^-D?. 


Although the large number of iterations required may seem unfeasible for practical 
applications, it is worth noticing that the automata-based search is efficient in that 
most of the search is concentrated on the "potentially optimal" solutions. A typical 
simulation is shown in Figure 3.7. The top plot shows that the action corresponding 
to N — 5 was selected only 37 times during the 4173 iterations, the last selection 
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Figure 3.7 A typical simulation (y = 0.05, model D?. Top: selected actions; middle: 
probability for selecting the optimal action; bottom: loss function as a function of iterations 


occurred at k = 600. After 1500 iterations, only two actions were selected, corres- 
ponding to N = 11 and N = 13. The probability of selecting the optimal action 
is shown in the middle plot. After 765 iterations, at least every second selection 
corresponds to the optimal action. From a practical point of view, the loss function 
is of great interest (bottom plot). It shows the average cost from the beginning 
of the iterations. Notice that a random selection with equal probabilities would 
result in a constant loss function value — 102.1. Clearly, the performance is already 
improved after the first 100 iterations, and approaches that of the optimal action 
(—112.3). Finally, let us note that a number of advanced reinforcement schemes 
exist, as discussed briefly in Subsection 3.3.2, aiming at increased speed of con- 
vergence. Indeed, results indicating significant speed-up have been reported with 
the use of discretized algorithms, or hierarchical automata, for example. 


To conclude, a new learning automata-based perspective for solving some optimiz- 
ation problems related to reliability was introduced. The approach was examined in 
numerical simulations of the optimization of the accumulated number of failures, 
in the framework of an optimal replacement policy for a deteriorating produc- 
tion system with preventive maintenance. The simulation results indicate that 
the approach suggested is a potentially promising approach for solving some 
optimization problems related to reliability and maintenance also. 


The next subsection is devoted to a hybrid scheme that allows larger (time-invariant) 
learning rates to be used with the reinforcement schemes. The optimality of the 
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solution is safe-guarded by statistical considerations, using the concept of approx- 
imated confidence probability. Indeed, this subsection enhances the ideas presented 
in its introduction. 


Learning Automata and Confidence Probabilities 
It is convenient at this stage to introduce some notations that we use throughout the 
following example. 


Variables and functions 


ACP approximative confidence probability 
counter 

confidence probability 

noise 

function, probability density function 

state of an automaton, i € T 

number of automata states (quantification number) 
set of states, Z = {1,2,..., 7} 

index, j € Z 

iteration index, n — 1,2,... 

sample mean 

sample mean (random variable) 

probability 

probability 

automaton action, environment input, u € U 
compact in RM 

finite discrete set of automaton actions 
variance 

function to be minimized 


«c QC ROS UC HMVmSS805 
"v 


Observe that an automaton is a particular case of Markov chain with a single state. 
In order to make a distinction between the actions and the numbering associated 
with them, we have, however, called this numbering "states." 


Greek letters 


y learning rate 
E environment response 
X indicator function 


Subscripts and superscripts 


x optimal x 

x related to confidence probabilities (modification 2) 

x related to normalization enhancement (modification 1) 
Xn value of x at iteration n, x, € {x],X2,...} 

x(i) ith element of vector x 
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x(i,j) (i, j)th element of matrix x 

x normalized x, x € [0, 1] 
ordered x, x (i) x X(j) ifi < j 
estimated x 


M) HIR 


Introduction The theoretical background of stochastic learning automata (SLA) 
is solid. They provide a framework for guided random search based on discret- 
ized search space and probabilistics. They are simple to implement, with few and 
transparent tuning parameters, and are simple to apply for constrained optimization 
problems as well. Given that the search space is compact, and the environment is sta- 
tionary with finite variances, the use of SLA schemes is well justified. The practical 
viability of SLA-based techniques suffers heavily, however, from the difficulties of 
finding a suitable discretization for the search space, as well as slow convergence of 
the related reinforcement algorithms. In many ways, the discretization problem is 
very similar to the partitioning problem encountered in process identification based 
on local basis functions (see for instance [24]). Slow convergence is due to small 
learning rates (often decaying with time), required by theoretical considerations 
[38, 51], which ensure convergence to the optimum. 


In this subsection, a hybrid scheme is suggested that allows larger (time-invariant) 
learning rates to be used with the reinforcement schemes. The optimality of the 
solution is safeguarded by statistical considerations, using the concept of approxim- 
ated confidence probability [10]. Numerical examples illustrate that the suggested 
method is able to find the optimal solution with affordable costs. 


Optimization Problem Let us consider a real-valued function V(u) of a vector 
parameter u € U, where U is a compact set in R™. The task is to find the value 
u — u* that minimizes this function, i.e., 


u* —arg min V(u). (3.49) 
uceU c RM 


Note that there are almost no conditions concerning the function V(u) (continuity, 
unimodality, differentiability, convexity, etc.) to be optimized. A global optimiza- 
tion problem of multimodal and nondifferentiable functions is considered. 


Let &, be the observation of the function V (u) at the point u, € U, i.e., 
En = V(un) t en, (3.50) 


where e, = en(w) is a random variable given on a probability space (Q, F, Pr) 
(w € Q — a space of elementary events), characterizing the observation noise asso- 
ciated with the point u, at sampling instant n. The stochastic optimization problem 
on the given compact that we intend to address is: Using the observations (£,), con- 
struct the sequence {u,,} that converges (in some probability sense) to the optimal 
point u*. The optimized function is not assumed to be unimodal or even convex. 
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Consider now a quantification {U (1), U(2),..., U (1)) of the admissible compact 
region U C 8$: 


UG)cU (3.51) 
Uu)D()-9 üijsLl..I (3.52) 
izj 
I 
(Jum =U c RM (3.53) 


i=! 


and 4 = {u(1), u(2),...,u(/)}, u(i) € U (i). Here the points u(i) are some fixed 
points (e.g., the center points of the corresponding subsets U (i)). The optimization 
problem can be represented as finding the optimal state i": 


i* = arg min E (£, |u, = u(i)) (3.54) 
ie 


using realizations (£4) from the stochastic environment. This type of problem is 
conveniently solved using learning automata. Figure 3.8 illustrates the optimization 
procedure to be considered. If / is large enough, the solutions of this and the initial 
problems are close (see [51]). 


Normalization The normalization block (see Figure 3.8) consists of two elements: 
the calculation of sample mean and variance, and scaling to unit interval. 


Sample mean and variance The expectation of the environment response asso- 
ciated with the ith state of the automaton, E(£, |u, = u(i) }, can be approximated 


environment 
(cost function) 


associate sample mean 


action and variance 


generation of f 
next action reinforcement 


automaton normalization 


Figure 3.8 A learning automaton connected in a feedback loop with the environment 
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by the sample mean s, (i): 


Die! Ek x (ik = i) 


MUT Xia cn 


where x is the indicator function: x (iy = i) = lifiy =i, x (iy = i) = Ootherwise. 
Denote the counters by cn (i): cn (i) = 3^5, X (ik = i). The sample mean can then 
be written in a recursive form: 


1 . i RET 
s) = (-z)r zh ire 
Sn-1() otherwise. 


The variance of the environment response associated with the ith state of the 
automaton, Var(£, |u, = u(i)), can be approximated by the sample variance: 


[ELax(G 9 0]-1 — 


For large n, a recursive form of (3.55) is given by 


vi) = (3.55) 


K 1 E T RECTE 
v (i) = (i-s)uae ue s,G ifi, =i 


v, 4G) otherwise. 


The variance of the sample mean, Var(s, (i)), can then be approximated by v, (i): 


vp (i) 


cn(i) 


Uni) = 


Scaling Scaling (normalization) to unit interval, £, € [0, 1[, is ensured by: 


E, = aa Salty) EN (3.56) 
MaxjeZ Sn (i) — MinjeT Sn (i) + ó 


where i, € T is the state selected at iteration (sample) n . In order to avoid division 
by zero, a small positive constant 5 is added to the denominator. Notice that £, 
represents the sample mean of the environment response associated with the state 
that was selected by the automaton, scaled to unit interval. This scaling ensures 
that the expected penalty approaches zero as the number of iterations approaches 
infinity: E [£,|i = i*] = 0asn — oo. 


Optimization Techniques 199 


Reinforcement Scheme A multitude of reinforcement schemes exist [38, 51] 
(linear and nonlinear, reward-inaction, reward-penalty, etc.). In the Shapiro— 
Narendra scheme (linear reward-inaction scheme), the probability updates are 
given by 


ee = py) - y(1 —&,)(1 — pn(i)) ifthe selected state i, = i 


Pn+i(i) = pa) — YC — &,) pa Q) otherwise, 

(3.57) 
where £, is the normalized environment response and y is the “learning rate” 
parameter of the scheme. Observe that the probability measure is preserved. It is 
known that the Shapiro-Narendra scheme is €-optimal. In the absence of a priori 
knowledge, the initial probabilities can be set equal, po(i) = 1/1. 


Action Generation The output u,..; € U of an automaton is obtained by selecting 
randomly one of the states: i44) € Z using the probability distribution ps4, 
and associating the state with a corresponding action: u,+; = U(in4)). A practical 
method for selecting a state according to the probability distribution is to generate 
a uniformly distributed random variable ¢ € (0, 1). The state is then chosen such 
that /44.1 is equal to the least value of i, satisfying the following constraint: 


3 pad) 2 t. 


k=l 
The action is then chosen as u44.; = U(in+1). 


Maximization of the Confidence Probability In this subsection an alternative for 
the reinforcement scheme is described. In this scheme, the action of the automaton 
at iteration n + 1, i444, is selected such that 


in+1 = arg max KACP asin (its (3.58) 
IE 


where AACP, iis (iz, i) is an estimate of the change in the approximated con- 
fidence probability given that the ith action is selected at iteration n + 1. 
First, the approximate confidence probability (ACP) is described [10]. It is then 
straightforward to develop an algorithm that maximizes the ACP at each iteration. 


This subsection is concluded by computational considerations. 


Confidence probability The computation of confidence probability is based on 
the Bayesian approach, i.e., considering the estimates s,(i) as random variables 
with a time-varying distribution, denoted by S, (i). Denote by i7 the action(s) with 
the smallest sample mean at iteration n: 


T% = arg min s, (i), (3.59) 
iei 


;* * 
where if € Z7. 
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The confidence probability CP, (probability that the expectation of S (i7) is smaller 
than the expectation of the other states, i.e., the probability that state i* is the optimal 
state) at iteration n is given by: 


CP, = Pr (Sulit) < SO( (1S) < Salik- 1) 
(196) < SG + D AN Sui) < Sa), 


given the history {éz}, {iz}, k = 1,2,..., n. A lower bound approximation is given 
by [10]: 


I 
ACPa (ip) = [[ Pr (Sa (it) < $6). (3.60) 
izligig 
where 
Pr (S (it) < Sq(i)) = f Pr(S,(it) < a) fsa (a)da, (3.61) 
—oo 


where fs, (;) (a) is the probability density function of the random variable S» (i). 


On the assumption that the distributions are normal, they are completely defined 
by their means and variances. At iteration n, the following can be assumed: 


: : : 4, UV) 
Sn (i) ~ N (ss), Un(i)) =N { ss), —— }. 
Cn) 
As in ordinal optimization [10], a confidence probability in that the r-top states 
include the optimal state can also be computed: 


ACP (i5, r) = [ [Pr (S. (it) < Sn(in)) (3.62) 
i-2 


where in, i, = 1,2,...,1 is the index to the ordered set T (Rios), 
where S,(i) < Sp(j) fori < j. 


Maximization of the confidence probability At iteration n, we can also estimate 
the distribution of the sample mean estimate associated with the state i, under the 
condition that the jth state is selected at the next iteration, n + 1. This estimate, 
denoted by S$,4.114 (i, j), can be obtained by noticing that the variance of the jth 
sample mean estimator decreases when a new sample is obtained, while the mean 
is assumed not to change: 


Sand, j) = Cali) +1 


Sn Ma) ~ N (Sn), un )) otherwise. 


Sus ~ N (sat, | ifi = j 
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Remark 43 When the number of samples cn (i) is small, the normal distribution 
Should be replaced by the Student t-distribution. 
since the mean is 


= it 


Notice that the optimal state does not change, i* s 


n+ljn 
not assumed to change. Estimates ACPrsin (ix, J), i.e., estimates of the ACP at 
iteration n + 1 under the condition that the jth state is selected next, are then given 
by 


I 
ACPa(it,j)= [| Pr(Sn+im (5.7) < Sarili, 2). (3.63) 
i-lizit 


In words, whereas the ACP, (iz) in (3.60) gives a measure of the confidence 


(approximative lower bound) that the state i7 is the optimal state, the ACP, (it, j) 
in (3.63) gives a measure (estimate) of the same confidence in the event that the 
jth action is selected at the next iteration. The probabilities Pr(Sn+tin (it, j) < 
$4 In G, J)) are obtained from (3.61). Finally, the estimated change in confidence 


probability, AACP ipn (ix, j), is obtained from 


AACPsap(iz,j) = ACPn(it, j) — ACP, (i2). (3.64) 
The confidence probability is maximized if the next state, i,41, is selected from 
Titin Where 


Trin = arg max AACP, (iz. j) - 


In the case that the maximum is not unique, the state i„+1 can be selected randomly 
from the set of largest elements. Notice that the maximization of the confid- 
ence probability can be seen as an automaton probability update scheme. At each 
iteration, the automaton probabilities are given by 


ifiez" 


Pri (È = | card Zr, 1), TM (3.65) 
0 otherwise, 


where Z** ,,, is the set of states maximizing the increase in confidence probability, 


n4 l|n 
and card Z7 ,, in is the number of elements in Z* 


n-l[n* 

Computational considerations The computational costs involved in the computa- 
tion of AACP niin (ix, j) are non-negligible, due to the numerical integration in 
(3.61). For practical purposes, approximations can be considered. Chernoff bounds 
were used in [26, 27]. In what follows, neural networks are trained to approximate 
the integral. 


Notice first that the integral (3.61) can be represented as a function f(x, Xn) 
of two variables xz :— Jv(if)/(s(i) — s(i})) and x, :2 Jv()/(s(i) — s(iž)). 
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Table 3.3 Parameters of a one-hidden-layer sigmoid neural network 


= H 1 
Ínin(x1, x2) = «4 gn x1, x2) + aH, gn X1. X2) = ————————— — 
E = : ido 1 + exp(—Bp ix; — Br) 


h,i Qh Bii Boi Bs 


0.180854669919 | —5.579554177258 | —1.546850007942 —0.143039042717 
0.195138789632 —2.241436244851  —0.552128565960  —0.047639394753 
1.186063899032 5.054616270515 2.899461967391 —2.044794159578 
0.502192049821 - - - 


AU N = 


This function can be approximated by a structure which is fast to compute, using 
basic nonlinear system identification techniques. The following hold: s(i*) < s(i) 
so that x* > O0 and x, > 0; f(x%, xn) = f(x, x5) so that only a function f(x1, x2) 
needs to be approximated, where x; := max{xn,x7} > x2 := min{x,,x*}. From 
the algorithm point of view, (df/dx,) < 0 and (df/dx2) < 0 need to be ensured. 
The parameters of a sigmoid neural network (SNN) [24] approximating f(x), x2) 
are given in Table 3.3, for x1, x2 € (10713,100], x2 < x}. 


Hybrid Scheme In the above, a basic learning automaton is described, as well as 
means for computing the confidence probabilities. Learning automata possess a 
number of theoretically appealing properties: in particular, much can be said about 
their convergence properties as n — oo (38, 51]. For example, it is known that 
the Shapiro-Narendra scheme is optimal (the optimal action is found with probab- 
ility 1). However, many of the theoretical advantages are not realized as a viable 
finite number of iterations. 


The two probability update schemes, presented above, have a different goal. 
Whereas the aim of optimal reinforcement schemes (e.g., the Shapiro- Narendra 
Scheme) is to construct a sequence of actions such that the optimal action is selec- 
ted with probability 1, the alternative scheme aims at finding a sequence of actions 
that will maximize the confidence probability. In what follows, a hybrid scheme 
is developed. The guiding principle is to constrain and enhance the operation of 
a learning automaton reinforcement probability update procedure by the confid- 
ence probability information regarding the current solution. This hybrid scheme is 
referred to as the LACP scheme. 


Two heuristics are considered: enhancement of the resolution in normalization, 
and a hybrid scheme. An algorithm for preserving the probability measure is also 
presented. Notes conclude this subsection. 


Enhancement of the resolution in normalization As the reinforcement learn- 
ing proceeds, the automaton will concentrate on the set of actions that provide 
small sample costs, and select less frequently the actions which provide large 
sample costs. The normalization procedure, (3.56), scales the sample means of 
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the environment response into the unit interval. This scaling is based on the max 
and min of the sample means. 


Many reinforcement schemes operating in the S-environment are extensions of 
the corresponding P-environment schemes (£, € (0,1); £,, = 0 in the case of 
“reward,” 1 in the case of "penalty"). Therefore, the reward—penalty aspect of the 
algorithms is captured when the environment response is in the unit interval (£, is 
close to 0 in the case of "reward," close to 1 in the case of "penalty"). Often, as 
the iterations proceed, most of the time is spent between a few best actions that all 
produce environment responses close to 0. 


In order to enhance the resolution of the normalization, the normalization can be 
based on the set of actions that contains the optimum set with a given confidence 


probability p***, i.e., equation (3.56) is modified as 
E AY (i ) — min wxx | S; (k) 
n= e o E APR (3.66) 
MAX {Ts**} $n (k) = MIN; (Te) Sn (k) + ó 
where the set of actions Z*** is defined as 
T" = [iurgia 1 ACP (itr) < p] G.67) 


and i, is the state of the automaton at iteration n. ACP (iz, r) is the approx- 
imative confidence probability in that the r-top states include the optimal state, 
see equation (3.62). In words, the scaling is conducted only among the states that 
contain the optimal state with a given probability p***, following the ideas behind 
ordinal optimization. 


Notice that the automaton state probabilities p,(j)’s are not modified, only the 
scaling provided by the normalization is altered. p*** can be selected to be large; 
we must have 0 < p*** < 1; with p*** = 1 the modification has no effect. In the 
simulations the value p 


*** — 0.98 was used. 


Hybrid scheme In order to benefit from the good aspects of the two schemes, a 
hybrid scheme can be considered. The probability updates given by the Shapiro— 
Narendra scheme, (3.57), and by the maximization of change in confidence 
probability, (3.65), can be combined by using the latter scheme for construct- 
ing lower bounds (hard constraints) on the probabilities. It is desired to ensure that 
two sets of states are always selected with probability greater than zero: the one(s) 
with smallest sample mean(s), 77, and the one(s) that increase(s) maximally the 
confidence probability, ne lin Denote the indexes to these states by Z7*: 


Tis = {Tr-Teh un} (3.68) 
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and pose a constraint on the respective probabilities: 


Ld 
qn+1 (i) = max [5 ZJ gie". (3.69) 
ntl 

where pn+1(i) is the probability update from the reinforcement scheme (3.57). 
From an engineering point of view, this choice of 77*, gives confidence on the 
correctness ofthe final result in that the statistical significance ofthe sample mean(s) 
is assessed. With pz" = 0, the modification has no effect. Posing p** = 1, the 
search is completely guided by the aim of maximizing the confidence probability. 
Here, the following is suggested: 


p; -1-ACP,GD)'^, (3.70) 


based on experimental results. At the beginning of iterations, the confid- 
ence probability is small and the search is determined by the maximization of 
AACP, A Mn (ix, in4.1). As the iterations proceed, and the confidence increases (the 
variance of the sample mean S(i) tends to zero with rate c(i)~!), this bound 


approaches 0: p** — Oas n — oo. 


Preservation of the probability measure In order to preserve the probability 


measure, yu qn+ı(i) = 1 must hold. Let us use the following procedure. 


Denote the states for which the probabilities must be increased by T SPa 


(i.e., the actions which belong to 77* , and have p441(j) < pz) Denote the 
remaining states in 275 , by Zr > Pa” (states with probabilities pa+ı(j) > pz"), 
and all other states by 7, ;, where Z;*, = lg Bare and T = 


* PEPR q**,P>PR q— 
ea d ga : 


Due to the lower bound (see (3.69)) the probabilities are altered by 


** 
Pn 


** 
card D 


Qna) = 


for all i e (Z^?! 77 ), 
Denote the sum of increases by Q: 


Q= J. B — Panli) 
jem 
and the sum of modifiable probabilities by R: 


R= E r(i), 


»-* 
7 *»P-pn . - 
ie» Sela 
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where 


r= {Po — Pat ifi e Tyan” 
Pn+i(i) ifi € Tapp 


and decrease the probabilities by 
, NE cm 
qn+1 (i) = Pansa (i) — ar) 


for all i € Er Lu }. 


The pn+ı(i)’s are then replaced by the g,+1(i)’s for all i € T. It is simple to 


show that the algorithm ensures the constraints qn41(i) > p; fori € 775, and 


preserves the probability measure: gn4.1(i) > 0 Vi € Z and 35 s Qn4 10) = 1. 


Notes The above modifications are convenient for a user, but introduce a new 
parameter p*** to the scheme, in addition to the “learning rate" y (Shapiro~ 
Narendra scheme) and the discretization problem (problem formulation). However, 
the algorithm is relatively insensitive to the choice of p***; with small values of 
p*** the reinforcement scheme operating in an S-environment approaches that 
of a P-environment. The choices (3.68) and (3.70) are heuristic and justified by 
common-sense reasoning. Whereas the inclusion of the state with smallest cost 
so far is intuitively clear, the choice for (3.70) was guided by the desire to avoid 
immature convergence (p7* is close to | at early iterations) and to have the effect of 
the constraint vanish (pz* — 0 as n — oo). However, these choices are intuitive 
and simple to alter; the choices suggested have provided adequate performance in 
all our experiments. 


The computational costs of the modifications are not negligible, even with the 
approximation given in Table 3.3. However, taking into account that the algorithm 
is to be used on-line with the evaluations from a real process, the efficiency and 
reliability of the algorithm are often emphasized against computational costs. 


Numerical Experiments In order to illustrate the feasibility and the performance 
of the algorithm presented above, let us consider the optimization problem (3.49) 
where V(u) (not available for measuring) was given by 


V(u) EEE s aa 
u) = — - aes 
2.2 4 


u € [1.5,19.5]. 


Realizations of V(u,), En, were available according to (3.50). un € U, en has zero 
mean and finite variance c?. The task was to find the u(i) € U that minimizes the 
function, using the realizations &,. 
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In the numerical simulations, the discretization was taken as U = {umin, Umin + 
A, Umin + 2À,..., Umax}, Umin = 1.5, Umax = 19.5, where the resolution of the 
quantization is equal to 


Umax — U min 
A -—————. 3.71 
58.74 (3.71) 
Values B € (4,5,6] were examined, which result in automata with 7 € (16,32, 64) 
actions, respectively. 


The noise e, were drawn from a Gaussian distribution with zero mean and variance 
c? € (0.1, 1). An example of the function as well as sample realizations are shown 
in Figure 3.9 for B = 4,0? = 1. Notice that the function realizations were severely 
corrupted by noise. Evaluating the noiseless function, the minimum was found at 
i*=8 for B = 4. 


In all experiments, the following pseudo-algorithm was used: 


l. Initialize recursions. 

2. Selecta state i, based on the probability distribution pn, and generate an action 
Un = U(in). 

Observe the environment response £,. 

Compute a normalized environment response £,. 

Compute the next probability distribution p,+1. 

Stop, or set n :— n + 1 and return to step 2. 


Aww 


Three different parameter update schemes were experimented with: 


e LAscheme using the Shapiro-Narendra scheme (3.57) for probability updates, 
PW = 0 in (3.69) and y = 0.25/I. 


1.50 19.50 
u 


Figure 3.9 Noiseless function (solid line) and five sample realizations (circles) for each 
discretized input u(i) € U for 1 = 16 (B = 4) 
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e CPscheme, which attempts to maximize the increase in confidence probability, 
equation (3.65), p;* = 1, in equation (3.69), 7/5 | = Tin: 

e LACPscheme, which is the hybrid scheme using (3.57) with (3.66), (3.65) with 
(3.68), and ( 3.70): p** = 1 — ACP, (i*)°?5; p*** = 0.98 and y = 025/1I. 


In a set of experiments, the parameter B ranged from 4 to 6, which resulted in 
automata with Z = (16, 32, 64] states and actions. Each experiment was repeated 
45 times; the maximum number of iterations was set to 50007. 


Figures 3.10-3.12 illustrate the average results from these simulations for differ- 
ent discretizations, J € {16,32,64}. Each figure shows the evolution of the loss 
function ®,: 


1 n 
$, = = ) Ek (3.72) 
k=1 


and the ACP, equation (3.60), as a function of iterations (logarithmic scale). 
The approximative confidence probability (ACP) gives a direct evaluation of the 
“goodness” of the solution, which has an exact interpretation in the case of Gaussian 


ACP, (i) 


10° 10° 10° 


10° 


Figure 3.10 Averaged simulation results for automata with 16 states (B = 4). The upper 
plot shows the loss, the lower plot depicts the ACP, averaged over 45 simulation runs. 
Notation: LA scheme (squares), CP scheme (diamonds), LACP scheme (circles); small 

noise level (dashed lines), high noise level (solid lines) 
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10! 10° 10° 10° 10° 


Figure 3.11 Averaged simulation results for automata with 32 states (B = 5). See caption 
to Figure 3.10 for notation 


noise. The average loss (Ẹ) gives a measure of the costs of the search. This measure 
has great significance when optimizing an industrial process during its operation. 
Notice that on-line search is constrained by the fact that the process must remain 
profitable and that the search for the optimum does not come without a cost. 


Results for all three schemes are shown: the basic" LA-optimization scheme using 
the Bush—Mosteller reinforcement (marked by squares), the scheme maximizing the 
increase in confidence probability (marked by diamonds), and the hybrid scheme 
(marked by circles). The dashed lines refer to the case with less noise (o? = 0.1), 
the solid lines to the high-noise case (c? = 1.0). 


With all three configurations (7 € (16, 32, 64}), the convergence of the LA scheme 
(squares) was noteably invariant to noise. The average losses (upper plots) 
decreased slightly slower in the case of high noise; similarly, the average ACP 
(lower plots) remained at a lower level in the high-noise case. 


The CP scheme seems to work very well in the low-noise case (dotted lines). 
The loss decreased rapidly with all three configurations; also the ACP increased 
quickly to 1. Unfortunately, it also seemed that the performance deteriorated with 
increased noise level (solid lines). Even if the loss decreased quickly during the 
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10° 10° 10° 10° 


Figure 3.12 Averaged simulation results for automata with 64 states (B = 6). See caption 
to Figure 3.10 for notation 


first few hundred iterations, the largest losses (after a few thousand iterations) were 
observed with this scheme. As could be expected, the highest ACP were observed 
with this scheme with both low and high noise levels. 


The performance of the hybrid LACP scheme provided a mixture of the two other 
schemes. Let us have a closer look at the case 7 = 32 (see Figure 3.11); similar 
conclusions can be drawn from the other simulations, too. The performance of 
the average losses shows that the LACP scheme was able to capture much of 
the cost-efficiency of the CP scheme during the first few iterations. During the 
final iterations the performance followed that of the LA scheme, with the smallest 
average losses. When looking at the confidence of the result, the ACP increased 
almost as fast as with the CP scheme. In particular, with the LA scheme the increase 
of ACP stopped at around 1000 iterations; with the LACP scheme, the ACP kept on 
increasing until the end of the iterations. This was due to the constraint posed by 
the hybrid scheme. Notice that the selection of the action that maximized the ACP 
was relatively rare. On average, the selection of the automaton output was based 
on the CP-criterion every 10th iteration (at n ^: 1000) and every 50th iteration 
(n œ% 10000), i.e., in 9 iterations out of 10 (in 49 iterations out of 50) the LACP 
scheme selected the automaton output and updated the state probabilities according 
to the (resolution-enhanced) Bush—Mosteller reinforcement scheme. 
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Table 3.4 Simulation statistics for o? = 0.1 


Scheme States %Correct Average ACP,, (i*) Average ®,, 


LA 16 100 0.94 ~1.547 
LA 32 90 0.92 —1.617 
LA 64 86 0.87 —1.632 
CP 16 100 1.00 —1.549 
CP 32 100 1.00 —1.639 
CP 64 100 1.00 —1.642 
LACP 16 100 0.99 —1.546 
LACP 32 100 0.99 ~1.627 


LACP 64 100 0.99 —1.639 


Table 3.5 Simulation statistics for cg? = 1.0 


Scheme States %Correct Average ACP, (i?) Average Pp 


LA 16 94 0.86 —1.529 
LA 32 88 0.81 ~1.615 
LA 64 64 0.58 —1.617 
CP 16 100 1.00 —1.505 
CP 32 98 1.00 —1.591 
CP 64 100 1.00 —1.594 
LACP 16 100 0.97 —1.537 
LACP 32 100 0.97 —1.612 
LACP 64 100 0.97 —1.606 


Tables 3.4 and 3.5 show the percentage of correct solutions, the average ACP at the 
end of iterations and the average loss at the end of iterations. For the LA scheme, 
the solution (third column in the tables) was given by the action corresponding to 
a state with the largest probability. For the CP and LACP schemes, the solution was 
given by the i7. The expense of finding the optimal solution are indicated by the 
average losses (right-most column in the tables). 


As is well known, the LA scheme suffers from the increase of the search space. 
This is reflected by the results; in the case of noisy data (Table 3.5) and 64 automaton 
actions (B — 6), every third solution obtained using the LA scheme was found to 
be incorrect. This is reflected by the ACP of the solutions, showing that it indeed 
works as a measure of goodness of the solution. Throughout the simulations, the 
results indicate that the CP scheme and the LACP scheme came up with the correct 
solution with a high confidence. In addition, with the CP and LACP schemes, the 
confidence increases as a function of the iterations, so that the number of iterations 
can be increased if the confidence level is judged insufficient. The average final 
losses indicate that the LA scheme is the most economical in the case of noisy data, 
the CP scheme in case of low noise levels. As already discussed, the LACP scheme 
provided a cost-effective search irrespective of the experimental noise levels. 
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These simulations illustrate a dilemma in stochastic optimization, the trade-off 
between confidence and loss. For any algorithm, it holds that the more the algorithm 
is allowed to explore, the greater are the costs due to deviation from the optimum. 
It seems apparent that the suggested hybrid scheme is able to combine much of the 
loss-efficiency of the Shapiro—Narendra scheme, and at the same time express and 
improve the confidence of the solution. Remember that even if it can be shown 
that the LA-based approaches are optimal, this presumes a suitable selection of 
the learning rate (in practice, the selection is conducted by trial and error) and 
may require a number of iterations that is practically non-viable (cf. time-varying 
learning rates). The high confidence of the hybrid LACP makes the selection of 
the learning rate less sensitive, as the reliability of the solution is assessed and 
the search is guided by measuring and simultaneously optimizing the confidence 
probability. 


Conclusions | An on-line algorithm for global process optimization on the basis of 
noisy realizations was presented. Two modifications to a basic stochastic learn- 
ing automata optimization using the Shapiro-Narendra reinforcement learning 
were considered. The first considered the resolution in normalization; the second 
constrained the search using statistical arguments. Approximate confidence probab- 
ilities are extensively used by the modifications, and a computationally affordable 
approximation was considered. Numerical experiments showed that the suggested 
algorithm was found to be cost-efficient and highly reliable. 


In process engineering, as in many other fields, on-line algorithms for solving 
global stochastic optimization problems are of great interest [24]. The main problem 
plaguing the learning automata-based approaches is the lack of feasibility of the 
search, which forces the engineer to make unjustified assumptions about the nature 
of the optimization problem in order to apply gradient-based techniques. It is our 
belief that the merging of learning automata with confidence probabilities can 
provide useful means in order to develop viable algorithms, applicable also in the 
real industrial environment. 


The next sections provide a brief introduction to the techniques of simulated 
annealing and genetic algorithms. 


3.4. Simulated Annealing 


The architecture and the behavior of learning automata are inspired by the struc- 
ture of biological systems. In fact, an organism is born with relatively little initial 
knowledge and learns appropriate actions through trial and error. The concepts 
and the vocabulary are borrowed from biology and psychology. The simulated 
annealing (SA) [30, 31], as well as the genetic algorithms (GA) [20, 21] dis- 
cussed in the next section, are also random search optimization techniques 
inspired by physical considerations and observations. Howell et al. [23] con- 
sider a genetic algorithm in which the selection procedure is done using a team 
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of learning automata with binary outputs. An optimization algorithm based on this 
team is presented and analyzed (convergence and convergence rate) in the next 
chapter. 


From a theoretical point of view, simulated annealing algorithms are based on the 
work of Boltzmann [28]. We shall briefly present his work, which we consider to be 
the foundation of simulated annealing, and explain clearly its probabilistic nature, 
via statistical thermodynamics. The “philosophical” idea behind SA is very simple. 
To our knowledge, each publication on simulated annealing algorithms starts by 
recalling the fact that they correspond to the manner in which liquids freeze, or 
metals recrystallize, and, consequently, how the corresponding entropy (measure 
of disorder, or lack of information) decreases. At any time, these physical processes 
are in thermodynamic equilibrium. This “premise” had already been discovered in 
the Iron Age, when blacksmiths realized that the slower the cooling, the more 
perfect the crystals that form. 


Consider a gas. If the pressure of the gas is low, it can be considered that the 
gas behaves like a perfect gas (no potential energy due to interactions between 
molecules exists). Boltzmann first introduced what is called the fundamental 
Boltzmann constant. This is where the entropy defined in the framework of commu- 
nication and thermodynamic theories coincide [13, 45]. The Boltzmann distribution 
is expressed as follows: the probability of finding a given macroscopic or micro- 
scopic system in equilibrium in an energy level U,, with a thermostat that maintains 
its temperature T constant, is given by 


] 
Pr(U = Us = z P (—BUs), 


with 


Z = Y exp (—BUs), 


where the summation is done over all the states of the system’. 


Methods of simulated annealing [16, 18, 54] have been proposed for the problem 
of finding numerically the global optimum of a function defined on a subset of a 
k-dimensional space. Usually, due to nonlinearities, real problems exhibit several 
local optima. Similarly to optimization techniques based on learning automata, sim- 
ulated annealing represents a global optimization method that distinguishes between 
different local optima and is suitable for optimization of multimodal functions. 


4 Inthe Maxwell theory of perfect gas, the probability that a particle belongs to an elementary 
volume in the speed space is proportional to an exponential term that is called the Boltzmann 
factor. The entropy is given by 


8 (T in Z) 1 
B———— 


apu eur 


S-k : 
kgT 
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The motivation of the methods lies in the physical process of annealing, in which 
a solid is heated to a liquid state and, when cooled sufficiently slowly, takes up the 
configuration with minimal energy. The development of SA established a mathem- 
atical analogy between optimization problems and thermodynamic systems, thus 
creating a new foundation from which to analyze and solve the minimization of 
other functions than the energy. 


Using the Metropolis acceptance criterion, transitions from solution i to solution j 
can be defined using the following expression: 


: _Afii R 
Pji (tk) = Gji exp ( e) AFi eg 
1 Afji x 0, 


where G; is the probability of generating the candidate j given the current state i; 
j is in the neighborhood of i; A fj; denotes f; — fi. In words, this presents the 
well-known greedy hill-climbing algorithm (for a minimization problem), with 
the exception that: 


e uphill moves in the objective function f are allowed, with a probability 
inversely related to the height of the hill, and 
e uphill moves decrease with a decreasing temperature tg. 


The Metropolis algorithm models uphill moves with a Boltzmann distribution: The 
probability of accepting an uphill move of size A f at temperature t, is related to 
exp(-A f /tx). 


Following the analogy with physical systems, the system is cooling so that the 
minimum energy state (i.e., minimum value of the objective function) will be 
obtained. The speed for lowering the temperature is given by the cooling schedule. 
Among the many cooling schedules, let us mention the logarithmic schedule 


Y 


tk = ————— 
* log(c 4- k)' 


where t; is the temperature at iteration (time index) k; c and y are positive constants. 


Given a set of possible problem solutions (configuration set), a well-defined neigh- 
borhood structure (move set), a suitable objective function (cost function), and 
an appropriate temperature profile (cooling schedule), the implementation of SA 
is easy and straightforward. As pointed out by Fleischer [18], for many practical 
applications, these implementation issues do not present much difficulty. A pseudo 
code is given in the following. 


Step 1. Initialize: 
Set maximum number of iterations kmax 
and cooling schedule parameters. 
Generate an initial solution v and evaluate f(v). 
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Step 2. Anneal: 
WHILE k « kmax 
Set iteration k=k+1, calculate temperature f, 
Select randomly a solution u from the neighborhood of v. 
Evaluate f(u) and compute A fuv = f(v) — f(u). 
IF A fu x 0 
* downhill move: accept it 
GOTO Step 3 
ELSE 
% Uphill move: accept maybe 
Generate a U(0,1) random variable R. 
IF R< exp(- $e) GOTO Step 3 END 
END 
END 
Step 3. Accept: Set the solution v:=u. Return to Step 2. 


From a mathematical point of view, the simulated annealing is a nonhomogeneous 
Markovian optimization algorithm. It corresponds to the behavior of a Markov 
chain with transition probabilities depending on a parameter, i.e., the temperature 
[6]. It also represents, in the same sense, a generalization of a Monte Carlo 
method. 


The bulk of the remainder of this chapter is devoted to the introduction of genetic 
algorithms. 


3.5. Genetic Algorithms 


Genetic algorithms (GA) [20, 21] are stochastic search and optimization methods. 
They are a part of a larger class of evolutionary algorithms (EA). The methods 
draw inspiration from biological systems as a computational and motivational 
model; hence, the origins of these methods can be traced back to Mendel's and 
Darwin's metaphors of natural biological evolution. EAs operate with a popula- 
tion of potential solutions, applying the principle of survival of the fittest [11]. 
At each iteration of an EA, a new generation of approximations is created by the 
processes of selection, reproduction and mutation. This leads to the evolution of 
populations of individuals that are better suited to their environment, just as occurs 
in natural adaptation. The differences between various methods of EA (evolution 
strategies, evolutionary programming, genetic programming, genetic algorithms) 
are in the relative emphasis put on the genotype (e.g., binary coded) representation 
over phenotype representation (e.g., real-valued coding), the role of the recombin- 
ation or crossover mutation, and the importance of a large population size over 
small population sizes [62]. The brief description given in this section follows the 
presentations by Chipperfield [11] and Whitley [62]. 


Let us first give an outline of a genetic algorithm. The steps are then discussed in 
what follows. 
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Step 1. Initialize: 
Select representation. 
Initialize and evaluate the population. 
Step 2. Iterate 
WHILE k « kmax 
Set k=k+1 
Select parents based on fitnesses of the individuals. 
Reproduce offspring: 
Recombine 
Mutate 
Evaluate new solutions: 
Evaluate objective function 
for each new solution in the population 
Calculate fitnesses. 
Update population. 
END 
Step 3. Accept and output solution. 


The first step in a GA is to select the representation and to create an initial population 
in a random fashion. The most common representation in GA is that of a binary 
string. The main argument for bit encodings is that this representation decom- 
poses the problem into the largest number of smallest possible building blocks. 
The essence of GA is in that it works by processing these blocks; the search process 
takes place at the coding level [19]. Empirical evidence suggests, however, that 
Grey codings, where adjacent integers are also Hamming distance 1 neighbors, are 
generally superior to binary encodings [62]. This can be traced back to the ability 
of Grey codes to preserve the connectivity of the original real-valued functions. 
Consequently, the optima in the Grey-coded space number not more than the num- 
ber of optima in the original real-valued function [62]. Real-valued genes improve 
the computational efficiency, and follow the philosophical tenet of acting as directly 
as possible in the phenotype space of the problem. For some specific problems, 
integer or other representations provide a convenient way to express that mapping 
from representation to the problem domain. 


The objective function f(xi) gives the evaluation of the performance of the ith 
individual in a population of N individuals, where x; is the solutions representation 
decoded into the problem domain. A fitness function F transforms the objective 
function value into a measure of relative fitness. The proportional fitness assignment 
is given by 


(f Gi) 
I PENNE DULL MN 
Lj- 8da) 
where the function g transforms the objective function evaluation into a positive 


number. Not surprisingly, many alternative fitness assignments have been proposed, 
including linear scalings and power laws, for example. 
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The parents are selected with probabilities related to their fitness values. Selection is 
the process of determining the number of times a particular individual is chosen for 
reproduction: 1) determination of the probability of reproduction, and 2) selection of 
individuals for reproduction (sampling). Efficient selection methods are of interest, 
in order to decrease computational complexity and load; roulette wheel selection 
and stochastic universal sampling are commonly used. 


Recombination is central to genetic algorithms. With binary representations, one 
or multipoint crossovers are used. First, a crossover point (or several points) is 
chosen randomly. Then bits between successive points are exchanged between 
the two parents to produce two new offspring. The disruptive nature of crossover 
encourages the exploration of the search space. With real-valued encodings, new 
phenotypes are produced around and between the values of parents’ phenotypes. 
The mutation operator flips the value of a single bit, with a given (low) probability. 
With nonbinary representations, the gene value is randomly perturbed, or selected 
from a set of allowed values. Again, a large variety of different recombination 
schemes is available in the literature. 


Once a new population has been produced, the fitness of the new individuals may 
be determined. Let N denote the size of the old population, and A the number of 
offspring. In (N, A) strategies the A offspring replace the parents; (N + A) strategies 
pick both from the offspring and the old population in order to create the new 
generation. A canonical genetic algorithm is of (N,N) type. A simple (1 + 1) 
strategy can be viewed as a greedy hill-climbing algorithm, making some random 
change and accepting only the improving moves. If A > N, some form of selection 
needs to be used to prune back the population to only N parents; if A. < N, a scheme 
is required that reinserts the new individuals into the old population. An apparent 
strategy is to replace the least fit members (one form of an elitist strategy), but 
alternative strategies may turn out be more successful. 


Holland attempted to explain the operation of a genetic algorithm as a hyperplane 
sampler. His schema theorem provides a lower bound on a change in the sampling 
rate for a single hyperplane. To put it very briefly [62], by increasing or decreasing 
the schemata representing competing hyperplanes in the population, according to 
the relative fitness of strings in hyperplane partitions, more trials are allocated 
to regions of the search space that have been shown to contain above-average 
solutions. In problems where there are clear regions that are above average, the 
GA does quickly allocate more trials to such regions. Notice, however, that the 
selection of the representation is crucial here, and that the GA is not guaranteed to 
yield optimal or even near-optimal solutions in general. 


3.6. Conclusions 


This chapter has by no means considered all the approaches for solving uncon- 
strained and constrained optimization problems. The approaches considered are 
not fooled by local optima and the assumptions related to their implementation are 
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not restrictive. The availability of asymptotic analysis of optimization algorithms 
provides a security for the user. These facts make these approaches useful for solving 
practical optimization problems in the framework of engineering problems. 
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Chapter 4 


Analysis of Recursive Algorithms 


4.1. Introduction 


This chapter is dedicated to the asymptotic analysis (convergence and convergence 
rate) of recursive stochastic algorithms, which are used to solve problems arising in 
many areas (engineering, economic, biology, ecology, medicine, etc.) (1, 2, 5, 7, 9, 
10, 15, 42,44, 50, 63, 64, 67, 81]. Forexample, there is now a huge volume of studies 
dealing with the impact of maintenance activities on production, environmental 
protection and the reduction of accidents (consider, for example, the explosion of 
the chemical plant AZF on September 21, 2001). Indeed, many reliability problems 
lead to an optimization problem. There is no ready made machinery that we can 
put our algorithm into and that would produce for us the asymptotic properties by 
turning the crank. Nevertheless, we shall try to derive some ideas of how to state 
the convergence of a given algorithm, and to estimate its convergence rate. We 
will derive the lines of a global methodology in order to achieve this analysis. This 
methodology will be presented in the form of “if-then-else.” 


This chapter is organized as follows. The next section deals with the methodology 
cited above. In Section 4.3, we will present several applications of the standard 
inequalities (Cauchy, Jensen, Minkowski, Hadamard, inequalities based on vectors, 
matrices and determinants, etc.), well-known lemmas (Borel-Cantelli, Kronecker, 
Toeplitz, etc.) and theorems (Robbins-Monro, Robbins-Siegmund, etc.). These 
applications will be extracted from the proofs of different results from the liter- 
ature. We will treat in detail two cases in Sections 4.4 and 4.5. The first case 
corresponds to the analysis of a recursive stochastic optimization algorithm on the 
basis of stochastic approximation techniques. It is based on a study carried out by 
Najim et al. [76], and corresponds to the extension of the McMurtry and Fu [61] 
reinforcement scheme. In the second case we consider another recursive stochastic 
optimization algorithm based on a team of learning stochastic automata with bin- 
ary outputs, and which has been developed by Najim et al. [77]. The asymptotic 
properties of this optimization scheme are carried out on the basis of the Lyapunov 
approach and martingale theory. The behavior of the algorithms presented in these 
two sections will be illustrated with some simulation results. Finally, the last sec- 
tion presents the analysis of a simple recursive scheme and a method for deriving 
the convergence rate. This method can be applied to more complex algorithms. 
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4.2. The Analysis of Recursive Algorithms 


In this section, we shall present a rnethodology for the statement of the asymptotic 
properties of a given recursive stochastic algorithm. 


4.2.1. Vector Form 


When dealing with the analysis of a stochastic algorithm, the first thing to do is to 
write the algorithm concerned in a vector form. To achieve this objective, we can 
use the indicator function 


_ yy o J! ifu=u(i) 
x =u) = |o ifu # u(i); 


and the following vectors: 


e — (1,...,1)7 
e(un) aus (0,0,...,0, 1,0,...,0)7, 


where the vector e(u,) is such that its ith component is equal to one if un = u(i) 
and the other components are equal to zero. 


In what follows we shall present an example. 


Example 56 (McMurtry and Fu algorithm) (see /61]) In the framework of 
learning automata, let us consider the following averaging technique: 


zw) = AZn (i) + (l — æ)zn(i) O<a<1 (4.1) 
Zn+1(j) = Zn), J * i, (4.2) 


where z444 (i) represents the realization of the following cost function: 


H \’ 
Zn41 (8) = (<=) , y-cte H=cte (4.3) 


and x, = x(i) corresponds to the action selected by the automaton at time n. The 
probabilities associated with the learning automaton are assigned as follows: 


Zn+i(t) 


i= l ZN; 4.4 
Zn+1 ( ) 


Pax (i) = 


where 


N 
Zari = 9 za. (4.5) 
ixl 
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Equations (4.1) and (4.2) can be written as follows: 


(1) 


^ 
= 


Zn4 1 (1) ZnCl) Zn(l) 
; 2| : | + ~ aæelu;)e ui) a ol o: 
Zn+1(N) Zn(N) zn(N) zn(N) 


The next example concerns the vector form of the Varashavskii-Vorontsova 
reinforcement scheme [68]. 


Example 57 (Varashavskii- Vorontsova scheme)  /n the Varashavskii-Vorontsova 
reinforcement scheme, the probabilities are adjusted as follows: 


For un = u(i) and £y = 0: 


Pn41G) = paG) + ya pa OU — paG)] 


: : : -— (4.6) 
Pn+iVi) = PAG) — YnPn(D pn), || i jM... N 
For un = u(i) and £y = 1: 
Pn) = pn) — Yn Pn GM — paG)] (4.7) 


Pn+1 (j) = Pn) + Ya Pn G) Pn (J), be pH 1.2.4: 


This algorithm (4.6) and (4.7) can be written in the following vector form: 


Da = Pa + yaPI e(us)(1 — 2&,)[e(un) — Pr, 
where dim p, = N, and p, represents the probability vector at time n. 


Observe that the previous vector form remains valid for learning automata operating 
in continuous environment responses (S-model environment), i.e., £, € [0, 1]. 


In Section 4.4., we will derive another vector formulation of the McMurtry and Fu 
[76] optimization algorithm. Recall that the vector form of the Bush-Mosteller [17] 
reinforcement scheme was given in subsection 3.3.2 and will be used in Section 
4.5. for the development of an optimization algorithm based on a team of learning 
automata with binary outputs. 


The recursive form for the binomial law is presented in Example 58. 
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Example 58 (Recursive binomiallaw) The binomial law for x and x +1 is given 
by 
! 
S n—-x 


n: 
Pr(X = x) = ———— : -1- 
1(X = x) xta cd q=1-—p 


ni 
Pr(X = Hanm (x+1) n=(x+1) 
NSE @+Din-G@d+pyir 4 


Noticing that (x + 1)! = (x + 1)x!, p**! = pp*, g"~@t) = g"-*/q and 
(n — x — D! = (n — x)!/(n — x), and reorganizing, we can rewrite the latter 


equation as 


n—xp n! 


Pr(X = Ll) = —— ~ — pq" ™ 
H ae) xtiqx(n—x ? 
It follows that 
Pr(X =x + 1) = -P P(X iy 
x+1q 


The matrix algebra, such as the inversion lemma! , which is used for the derivation 
ofthe recursive form ofthe least squares method, represents a useful tool in deriving 
recursive algorithms. In [91], in the proof of Theorem 6 of Chapter 4, pp. 132-133, 
the inversion lemma is used. 


A remark is in order here. 


Remark 44 The reader must also keep in mind the tools related to state-space 
representation, and take advantage of them in order to carry out the vector form of 
the algorithm considered. 


The next subsection deals with the Lyapunov approach. 


4.2.2. Lyapunov Approach 


Lyapunov functions have been used in various contexts (stability, convergence ana- 
lysis, design of model reference adaptive systems, etc.). The Lyapunov approach 
is based on the physical idea that the energy of an isolated system decreases. 
A Lyapunov function maps scalar or vector variables to real numbers (RN 5 9.) 
and decreases with time. The main attribute of the Lyapunov approach that makes 
it appealing for solving all the aforesaid engineering problems is that it is simple. 
The main obstacle to the use of Lyapunov theory is in finding a suitable Lyapunov 
function. In the next developments, we assume that for the problem considered a 


! The inversion lemma plays an important role in the development of recursive algorithms 
in many areas such as estimation, prediction, adaptive control theory, etc. 
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Lyapunov function is available. In [68], Lyapunov functions associated with the 
commonly used reinforcement schemes are given. 


Akashi and Moustafa [3] have presented an on-line structure for the parameter 
estimation of a class of stochastic systems. The convergence of this structure is 
stated using a Lyapunov candidate of the form 


Wk) = 3 edo? , 
where &(K) is a random vector (parameter estimation error, etc.) and I-II? denotes 
the Euclidean norm. This is a commonly used function in estimation problems and, 


for example, in the design of model reference adaptive systems. 


In the analysis of the Robbins-Monro algorithm: 


044.1 = On + yn Yn (4.8) 
oo oo 
Yom =00, $ m? < 00, (4.9) 
n=1 n=] 


under the assumption that the conditional expectation of Y,,, is a regression 
function of 6,: 


E{Yn+1 | Fn} = —9(0,) 
and that 
(e—6*) e(60) -0 fore e" 


ElYs4i + On)? | Fn} = 0? On) 
c?^(0) + lg(6)0? = s? (0) < K(1 + 012), — K — cte, 


where 7, is the o-algebra generated by the available data, Duflo [26] used the 
following Lyapunov function: 


Wn (0) = l6, — 0*1. (4.10) 
From (4.8) and (4.10), we derive 


Wn+i(8) = On + Ya Yni — 0]? (4.11) 
= Wp (0) + (yn)? W¥n till? + 2yn (On — 0") Yny 1. (4.12) 


From the assumptions, we obtain the following estimation: 


57.) < K(1 + 0al?) x K'(1-- |. —6*]^) — fork! «oo. 
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Taking the conditional mathematical expectation in (4.12), we get 


E(Wasi | Fn} € Wn + (Yn)? S? (0n) + 2¥n (On — 0*)7 E {Yn | Fn} 
€ Wal + K!(yn)*) + (vn)? K?! — 2yn(On — 07)? (On). 


The convergence can be easily derived by making use of the Robbins-Siegmund 
Theorem (see Appendix A). 


The algorithms related to constrained optimization problems and based on the use 
of Lagrange functions correspond to the following min—max problem 


inf sup L(0,21), dima = m. 
(X39) Xem" 


A commonly used Lyapunov function [91] in order to establish the convergence of 
the concerned optimization algorithms is 


Wa = Jos = p? s xl. 


where 0,4, and (0*,X*) represent respectively the argument, the Lagrange 
multipliers vector and the saddle point. 


When the regularized penalty function approach is used for solving a given con- 
strained optimization problem [91, 93], the following Lyapunov function can be 
considered: 


2 
Wr = le. -eL. 


where 


LP := arg min Py (8) 


and P,,,3(@), u and ô represent the regularized penalty function, the slack variables 
(penalty multipliers) and the regularizing factor respectively. Similar Lyapunov 
functions have been used in the framework of game theory [73, 94, 95]. 


Vandenberghe and Boyd [111] have developed a polynomial-time algorithm for 
determining quadratic Lyapunov functions for nonlinear systems. For such sys- 
tems, quadratic Lyapunov functions can be determined using convex programing 
techniques. Vandenberghe and Boyd describe an algorithm that either finds a quad- 
ratic Lyapunov function, or terminates with a proof that no quadratic Lyapunov 
function exists. The algorithm is an interior-point method based on the theory 
developed by Nesterov and Nemirovsky [80}. 


In the proof of Theorems 1 and 2 in [79], the estimation error covariance matrix has 
been considered as a Lyapunov function. The Lyapunov methods have also been 
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used to derive limit theorems and rates of convergence for processes with a renewal 
process with common inter-renewal time distribution [48]. 


4.2.3. Robbins~Monro Approach 


The fundamental approach of stochastic approximation techniques was initially 
developed by Robbins and Monro [99] in the 1950s for solving regression equations. 
It was extended by Kiefer and Wolfowitz [47] in 1952 for finding the extremum 
of regression equations. These studies, which are considered “classic,” constitute 
the foundations of stochastic approximation techniques. The previous studies were 
extended to the multivariable case by Blum [16]. Several techniques [46, 109] 
have been developed in order to accelerate the convergence of the approximation 
algorithms. Tsypkin [109] has shown that many problems related to pattern recog- 
nition, control, identification, filtering, etc., can be handled in a unified manner as 
learning problems by using stochastic approximation techniques. 


In order to illustrate the behavior of the stochastic approximation technique, let 
us consider the problem related to the estimation of the mean m, of a given 
random variable X on the basis of n independent and identically distributed obser- 
vations (X1, X2,..., Xn). By making use of the strong law of large numbers (see 
Appendix A), we obtain the following estimation: 


1 n 
Mn E (4.13) 
i= 


This estimator is consistent. In other words, the sequence of random variables 
Zn = my converges to the mean value of X with probability 1. 


Expression (4.13) can be written in recursive form as follows: 


Mn 


en 1 oe 


n-i 


1 (n — 1) 1 
= -X ae X; 
n a n a-52. : 
1 n-1 
= -X,+ : hist 
n 
1 
= Mn-1 + 7 (Xn — Mn-}). (4.14) 


This expression shows that after n trials (realizations), the previous estimation of 
the mean is corrected by a term which is equal to the division by n of the difference 
between the nth observation and this estimation. 
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We shall now generalize the algorithm (4.14). Let us consider the random variable 
Y defined by 


Y-X-06, 


where Ó is an arbitrary scalar. Let us denote E{X} by m,. The mathematical 
expectation of Y conditioned on 0 is 


E(Y |0}  E(X -0 | 0) = E(X) -0 =m, —6, 


which is a function of @, i.e., a regression function. The estimation of the mean m, 
leads to the calculation of the root of the regression function 


9(0) := E(Y |0) m, =0. (4.15) 
As the estimation m, of the expectation my is a constant, the sequence of random 
variables Z, = m, (n = 1,2,...) converges in probability to the root m, of the 
equation 


9(8) — 0, 


which is linear and depends only on one parameter. At each trial (sampling time n), 
we get the realization 


Y, = Xn = Zn-1 
and with (4.14) we deduce the observation (realization) of the random variable 


1 1 
Za = (1 ~ 3j Zn-1 + —Yn. 
n n 


Let us now consider the more general case that is related to the solution of the 
following equation: 


9(0) = cte =a. 
We assume that this equation has a unique solution which will be denoted by 0*. 
We shall construct a stochastic approximation algorithm for solving this equation 
as follows: 


On = On—-1 + ¥n(Yn — a) (4.16) 


with 


oo 
Yo yn = 00 (4.17) 
n=) 
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and 
o < oc. (4.18) 
n=} 


In many applications, the correction factor yn is selected to be equal to 1/n. The 
meaning of these requirements for sufficient convergence is very simple: y, has to 
decrease in order to remove the influence of disturbances, but not so rapidly that a 
point different from the optimal one is reached. When noise is not present, y, can 
be constant or a decreasing sequence that converges to a constant value. 


As previously mentioned, the proof of the convergence of the stochastic approx- 
imation algorithm given above can be carried out using the Lyapunov approach 
and martingale theory (see for instance Theorem 1.4.26, p. 29, in [26]). This proof 
shows the intimate connection between the Lyapunov approach, the martingale 
theory and stochastic approximation techniques. This connection is also shown in 
many other studies (see for example [59]). In [68], detailed comments related to the 
pseudogradient conditions for learning automata reinforcement schemes are given. 
The main theorems related to stochastic approximations techniques are given in 
[19, 20, 26, 27, 28, 35, 39, 51, 52, 57, 59, 84, 112]. 


The algorithms based on stochastic approximation techniques are very simple, 
need little memory, and have a reduced number of design parameters. Several 
stochastic approximation algorithms have been developed on the basis of Lagrange 
and penalty functions, in order to deal with constrained optimization problems 
(39, 51, 52, 59, 112]. The algorithm developed by Walk [112] has been used, e.g., 
inthe design ofa constrained long-range predictive controller using neural networks 
[69], and for training logic processors under constraints [71]. 


The proofs presented in the paper authored by Gladyshev [35] can be easily fol- 
lowed by the reader. In this paper the author presents a very comprehensive proof 
of the Robbins-Monro stochastic approximation algorithm. He shows that the 
Robbins-Monro stochastic approximations can be applied to the least squares 
method. Indeed, he shows that the sequence {x,} converges with probability 1 
to the value of the estimated parameter 0, and that 


Vn (Xn — 0) 


is asymptotically normal with zero mean and finite covariance matrix, defining 
in fact the corresponding convergence rate (results of asymptotic distribution of 
stochastic approximations are given in [102]). 


As in [26], this proof is based on martingale convergence arguments. The first step 
in the proof consists of taking the conditional mathematical expectation, given the 
a -algebra generated by the previous estimations, of the estimation at stage n. This 
algorithm is connected with the problem of finding the root ofa regression equation. 
Notice that Chong and Ramadge [19] follow the same line as in [35] for stating 
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the convergence of their recursive optimization algorithms using infinitesimal per- 
turbation analysis estimates (derivative with respect to the mean service time in 
recursive optimization of queue systems). The idea of the proof is to examine the 
asymptotic behavior of the weighted difference between the estimated argument at 
time n and the zero of the function to be minimized. 


Pelletier [84] gives an overview of the methods used for the analysis of stochastic 
approximation algorithms: 


martingale theory; 

central limit theorems; 

invariance principles; 

law of iterated logarithm; 

quadratic strong laws of large numbers. 


She considers a quite general multidimensional algorithm that includes the 
Robbins-Monro and Kiefer-Wolfowitz, as well as algorithms with small 
Markovian disturbances: 


Zn+1 = Zn + Yn{h(Zn) + rn+1) + OnEn+1, 


where h(-) € RI and is t7 -valued, the perturbations (r, ) and (e, ) are two sequences 
of d-dimensional random vectors, and {yn} and {o,} are two nonrandom pos- 
itive sequences, fulfilling conditions (4.17) and (4.18). Pelletier states several 
results. She establishes the law of iterated algorithms, which gives the almost 
sure convergence rate, and quadratic strong law of large numbers. 


Many authors argue that it is enough to state the properties of the noise, in order 
to use the stochastic approximation tools for stating the convergence of a given 
algorithm that resembles one of a stochastic approximation. From our point of 
view, we consider that, in general, to state the properties of the noise is not an 
easy task, and it is preferable to handle directly the analysis of the algorithm 
considered. In other words, the proofs of the main available tools for the analysis 
of stochastic approximation techniques are facilitated by the assumptions about the 
noise. However, the difficulty ofthe analysis is not avoided but it is only moved to 
the assumptions. 


Remark 45 In [26], the Robbins-Monro convergence theorem is proved on the 
basis of martingale theory and the use of Robbins-Siegmund Theorem. 


4.2.4. The Ordinary Differential Equation 


So far, we have not said anything about the ordinary differential equation (ODE) 
method [13, 28, 57, 59]. The idea behind this method is very simple. It consists of 
associating an ordinary differential equation with the averaged recursive algorithm 
considered. Even if this idea is very simple, its application is not an easy task. This 
is our slant on this method. 
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This section establishes heuristically the link between the analysis (statement of 
the asymptotic properties) of recursive stochastic algorithms and the study of the 
stability of a deterministic ordinary differential equation [26, 57]. 


Many recursive stochastic algorithms can be written in the following form: 
Cn = Cn-1 + Yn Q(Cn-1, Wn), (4.19) 


where, in general, c, € RM is the estimated parameter vector and wn represents 
the observation noise. In what follows, we shall, for simplicity of presentation, be 
concerned only with the scalar case. 


We shall be concerned with the determination of the root of a given function / (x). 
Let us rewrite (4.19) in the following form: 


Cn = Cn-1 + Yaf (65-1) + [O(Cn—1, Wn) — f (65—1)]). 
Heuristically, the term 
Q(Cn—1, Wn) — f (Cn-1) 


can be considered as the observation noise. Instead, however, Q(¢n—1, Wn) repres- 
ents a realization of the function f (x) to be optimized. Observe that, in general, 
due to the observation noise, the term Q(c,—1, Wn) is quickly time-varying, and 
the sequence can be assumed to vary slowly if the correction factor y, is relatively 
small. 


Taking into account the previous remarks, the algorithm under consideration can 
now be reformulated as follows 


€; = €5—1 + Yn f (Cn—-1), 
which represents an ordinary differential equation written in a discrete form. 


The next subsection presents a “methodology” in the form of “if-then-else” rules. 


4.2.5. Summary 
In summary, the analysis of a recursive algorithm can be done as follows: 


e Derive the vector form of the algorithm. 
e lIfthere are constraints, then: 
— Introduce a Lagrange (eventually augmented or regularized Lagrange 
function if it is not convex?) or a penalty function and start to deal with 


2 A function is said to be convex if and only if all its chords are located above or on its 
graph. 
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the existence of the saddle point, and construct a Lyapunov function on the 
basis of the saddle point. 
e Else: 
— Ifa Lyapunov function exists, then: 
* Use it in connection with the well-known inequalities, lemmas, theor- 
ems and martingale theory, to state the convergence. 


— Else: 
* Can the algorithm be written in a form where the new estimate is equal 
to the previous one plus a correction? If yes, then: 


o State the properties of the correction term. 

o Ifthe algorithm resembles that of a stochastic approximation, then 
use the results related to stochastic approximation techniques. 

o Otherwise, use the general tools (well-known inequalities, lemmas 
and theorems, martingale theory) for convergence analysis. 


The previous and the next sections represent the main skeletal structure of this 
chapter. In what follows we shall present several important tools (inequalities, 
lemmas and theorems) used in the analysis of stochastic recursive algorithms. The 
use of these tools will be illustrated with many simple examples. 

4.3. Use of Some Inequalities, Lemmas and Theorems 

As pointed out by many authors, infinite series play a non-negligible role in the 
statement of many theoretical results (proofs of lemmas and theorems). There exist 
good references on the subject of series [4, 38]. Let us start this section by presenting 


an example related to an infinite series of positive numbers [104]. 


Lemma 6 Let {a,,n > 1} and {bn,n > 1) be positive sequences, such that {bn} 
is increasing, and 


oo 
J a,b, < oo. 
n-l 


For each k > 1, let ny be the smallest integer n > 1 such that b, > k. Then 
oo oo 
XXe 
k=l n=nk 


Proof Let us consider the following double sum: 


»x Xa. (4.20) 


k=] n=nk 
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In view of the assumption of this lemma, this sum is greater or equal to zero. Let 
us divide and multiply by bn, which is positive. We then obtain 


Since for n > ng, b, > k, we derive 
E» 2. pb S, umb 
k=l n=n 


Taking into account that k > 1, it follows 


Since poaae agb, < oo, we obtain the claim of the lemma. 


The next example corresponds to a lemma due to Feller (32] that is related to 
normal random variables. The proof of this lemma is based on simple algebraic 
inequalities. 


Lemma 7 (/[32]; see also Lemma 5.1.1 in [104]) Let X be an N(0,1) random 
variable. Then 


-l 2 
Pr(X > €) < AU? (-5). fore » 0. (4.21) 


For any given y > 0 there exist £(y) such that e > &(y) implies 


a epe) 


Pr(X > £) > exp (- 5 


(4.22) 


Proof The probability distribution of the random variable X is given by 
1 x? 
(x)= ex (- >) 
f ar 9975 
and the following elementary inequalities hold: 


2 2 2 


(1 — 3x74) ev(-5.) « ev (5) « (1 ind ew (5) . (433) 
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Let us fix the value of €. Multiplying (4.23) by 1/./2z and integrating over the set 
(£, 00), we obtain the desired result (4.21), i.e., 


zii Fed ex ( =) Pr(X >£) < e- e ( 2) 
——— ——]j« > xpi—— }. 
eo A Um RA 2 


If we choose £(y) sufficiently large, we obtain the second result (4.22). 


In what follows we shall present a lemma and its proof which involves the Bennett 
inequality [12, 34] (see Appendix A). The Bennett inequality is formulated as 
Corollary 1 of Theorem 4 in [34]. This lemma has been stated and proved by 
Devroye [23] in the framework of optimization on the basis of learning automata. 


Lemma8 (Lemma 1 in [23]) Let us consider a sequence of independent random 
variables Y,,.. . , Y, with 


1 with probability aj, i — 1,...,n, oj € [0,1] 
Y; = 
0 otherwise. 


Then 
n n n 
) l ) ; REM Qi 
" (x ps 2 i=l «) STE Gace) 
and 


» (|X = j $a) s20 (- ew) 


Proof Let us consider the sum o? of the variances of the random variables Y;, 
i=1,...,n, divided by n: 


J i-ai) 


ear DEl - EY] = Lien -a 


a; € [0,1]. This sum is bounded by 


Let us first observe that 
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and 


have the same variance. 
By making use of Bennett inequality (see Appendix A) for 


pol a lecho pk uL 
n n n 


we obtain 


Pr (: Se —aj)> e) < exp [ne — (ne + no?) log (1 + 5) 


i=] 


Pr (: D (a; — Yi) > e) < exp [ne ~ (ne + na?) log (1 + =) : 


izl 


(oan) 


Taking into account the following inequality’: 


2x 
log(1 > f 0, 
og(1 +x) > ja orx > 
we derive 
Pr LS Gays < exp [ne - (ne +n0?)( 3") 
n c ! doses 20? +e 
POE ( —ne? ) 
i vee a 


3 The derivative of 


2x 
f) = log + x) — 244 


is equal to 


df(x) _ x? 
dx — (12x3)2 4 x 


>0 for x > 0, 


and 


efe ps ne SO: 
dx 
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Since 


ie ls 
Cen 2 sr (1 E -a s-e), 


we obtain 


iz ng 
Pr (: oq —aj)< ~e) < exp (x) , 


i=! 


Based on the upper estimation of ø?, and for 


I 
~~ 
H 
3 |= 
x 
IV 
Y|- 
3 
R 
+ 
= |= 
Ds 
R 
wee” 


ee 
=x 
IV 
g|e 
M: 
R 
No” 
IA 
e 
Ea] 
"3 
| 
M 
e 
R 
——— 


and 


ES E ES 
Pr(- Y i-ai) < -e| =Pr(- Y Yi- Yos 
imm t "il 
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From (4.25), we derive 


Pr (Edw —aj)< ~e) = Pr (£e "eps ~ns) 


i=l 


= "(xx < a - ns) 
i=l i=l 
n n 1 n 
=F (ox < >a; =n a] 
i=l i=l i=l 
Pr (È Yi < se) < exp (-5&5). 
izl i=! 


10 


To prove the second assertion of this lemma, let us consider the term 


a Yo, — aj) 


i=l 


i= 


> 2») = Pr (da — aj) > iDa) 
i=l 1 i=l 
+(e -a 5-3 3a] 
n 3 n 
=P (Dox > 3») 
n l n 
en(xn :Y«) 


n H 
x 2exp (-2m*) : 
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The next example shows that the existence of the variance of a given random 
variable x implies the existence of its first moment (mean). 


Example 59  /f the variance of a random variable x exists (is finite), i.e., 
E(IxP?] < oo, 
then its mean exists. 


Solution For x; := l and x? := x, the Schwartz inequality (see Appendix A) 
allows us to write 


E*{|x|} < E{ix?} +1. (4.26) 
Taking the square root, we get the desired result: 
E{\x|} < oo. 


In the next example we will make use ofthe Chebyshev inequality (see Appendix A). 
Substituting (x — E{x})? into the Markov inequality (see Appendix A), we obtain 
the so-called Chebyshev inequality^, i.e., 


1 
Pr(|x — E{x}| >a) € zc a » 0. 


Example 60 (Convergence in probability and in mean squares) Ler {X,,} be 
a sequence of random variables that converges in the mean squares sense to X. 
Show that (X4) converges also in probability to X. 


Solution Let us use the Chebyshev inequality for Xn — X. We obtain 
1 
Pr(\Xq—X{26)<—E[IXn— xP}, © >0, 


which implies the desired result if E(IX, — x?) — noc. 


In the proof of Theorem 3 in [96], the Chebyshev inequality has been used to derive 
for any tiny y > 0, 


2 ^ 
; A n (A1) Var(6n) 
Pr( nj), — E(6n}| > y) < UE re) 


This inequality has been also used in the proof of the classical degenerate 
convergence criterion [60]. In what follows, we shall present another example 


^ The Chebyshev inequality can be interpreted as follows: A small variance shows that 
large deviations from the mean are less probable. This idea is behind the minimum variance 
regulators and controllers in the framework of process control. 
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which involves the Borel-Cantelli Lemma (see Appendix A) and the Chebyshev 
inequality. 


Example 61 (This example constitutes the continuation of Example 22 in 
Chapter 1) Consider a sequence of random variables {Xn},> | uniformly distributed 
on the segment [O, 1/n]. State the almost sure convergence of Xn. 


Solution n view of the Borel-Cantelli lemma (see [103], p. 254, Corollary 2), if 
for some sequence En |, 0 (sequence decreasing to zero) we have 


oo 
Y iPrüx, - X| > £n) < 00, (4.27) 
n=1 

then X, => X. For X = 0 and in view of Chebyshev S inequality 


E{|Xnl?) 


2 
En 


Pr(|Xnl = En) < (4.28) 


and the following equality’: 
E{|Xn — mal} = E(X2 ~ 2Xnmn + m2] = E[X2) - 2E{Xq)} ms +m? 
= E{X?} - 2m + m? = E(X2] - m2, 
we derive 


Y Ls, 2 Y E [X, d Lm 
n n 


oo 
So PriXnl 2 én) < 


n=} n=1 n=) 


Using the results derived in Example 22, we have 


oo Ma 1 
-2 
DPr(iXnl 2 en) < Des [cere 


n=] n=l 


L; 
= -— « oo 
2492 i 
p) 8n 


Le) 


5 This equality can also be proven by calculating the following integral: 
l/n 
A 2 
E [ixs — mni ] = f nix ^ mn dx,mn. 
0 


242 Stochastic Processes 


Let us take en = n* (0 <a < $). Since® 1/(e2n?) mom = pÊ 
(B — 2 — 2a > 1). This proves the result. 


Lemma 9 (Bernoulli law of large numbers) (see /60]) Let us denote by S, the 
number of occurrences of an outcome in n repeated Bernoulli trials, then Ye > 0: 


as n — oo. 


Proof Taking into account that the mathematical expectation E(S,) of S, is equal 
to np, we derive 


r(e- 


= Pr(|$, — E{S,}| 2 ne). 
Upon applying the Chebyshev inequality (see Appendix A), we obtain 


Sn —np 


> e) = Pr(|S, — np| > ne) 


Var(Sn) | npq 


Pr(|$, — E{Sn}| > ne) < S 7d 


Theorem 5 (Weak law of large numbers) (see [24, 25]) Let X1, X2,..., X, be 
independent random variables for which 


E{Xi}=m and Var(X;)=07, (i =1,...,n). 


Let € > O and 8 > 0 be arbitrary. Then there exists N(e,8) such that for all 
n> N(e,6) 


or, equivalently, 


where S, represents the sum 
n 
Xx. 
i=l 


6 3722 , 1/n? is convergent for a > 1, and divergent for æ < 1. 
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Proof Let us calculate the mean and the variance of Sẹ: 


E{S,} = E (Sxl = $ EX) =nm 
i=} i=l 
2 
Var(Sn) = —na? = os 
n n 


By making use of the Chebyshev inequality, we derive 


2 
pr (| -m| > e) < Var(Sn) = E <ô. 
n 


e 


To end the proof of this theorem, it is sufficient to select N(¢,5) as the smallest 
positive integer n that fulfills the following inequality: 


c? $ 
—~ <ô. 
ne? 


The Bunyakovsky—Schwartz (Cauchy-Bunyakovsky) inequality has been used in 
the proof of many statements (see for example the proof of Theorem 3, Chapter 3 in 
{91]), including the existence of integrals related to orthonormal sets with respect 
to a distribution function [53], i.e., 


f gi(x)gj(x)dF(x) = jj (continuous case) 


2 gi (x); (x) Pr(x) = 8j; (discrete case), i,j =0,1,2.... 


x 
The statement of the Chebyshev inequality is given in Appendix A. We shall present 


here another proof of this inequality. The proof is based on the use of the indicator 
function [60]. Indeed, 


E{x?} = Efx? pasa] + Efx? Ipizi<ey} 
> E{x*Ipxire}} 
> € Elta]: 


Taking into account that |x| > £, we obtain the following estimation: 
2 — 22 
> E^ E{ltjx|>e)} = €% Pr(|x| > e). 


The main developments related to random variables can be carried out on the 
basis of complex random variables. In this case the characteristic functions that 
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require random variables that take complex values, are commonly used in the 
proof of limit theorems. In this framework, among others, the following inequalities 
are used: 


2 
lexp(ia) — 1 — ia| < T 
lexp(z) -z - 1| < lz, [zl 4. 


One application of these inequalities is given in the proof of Theorem 1 in [49], 
which states that, under some assumptions, the distribution functions of random 
sums weakly converge to a distribution function. The theorems proved in this paper 
are analogs to well-known limit theorems for sums of independent random variables 
with a nonrandom number of summands. 


The following inequalities: 


logx <x-!, x>0 


2 
exp(x)— l ~x < > exp (|x|) 


4 exp(—2) 


h2 H h»0 


max { x? exp(—hx),x > o] = 


are useful for the statement of many theoretical results (see for instance the proof 
of Theorem 1 in [55]). 


The following inequality is useful when we deal with quadratic expressions 
(norms): 


l 
la + bl < (1 + €) lla] + ( + :) dll, e>0. (4.29) 


Stochastic programming problems can a priori be solved using nonlinear 
programming techniques. In this case it is necessary to evaluate the cost function, 
and their derivatives if they exist. These evaluations are, in general, time- 
consuming. There exists two main approaches to handling these problems: 1) app- 
roximation techniques, 2) methods based on the quasi-gradient. If the function to be 
optimized (more exactly, its mathematical expectation) is convex, then error bounds 
based on the Jensen inequality and on the Edmundson—Madansky inequality (see 
Example 15, Chapter 1) can be carried out and used instead of the function itself 
(see for instance Chapters 1, 2 and 11 in [30]). There exist many applications of 
the Jensen inequality. In the proof of Lemma 2.5 in [45], Johnson makes use of this 
inequality. 
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Example 62 Consider the following expression (see the proof of Theorem 4 in 
Chapter 4, pp. 124—126, in [91]): 
2 


e" — Ne(u,) 
Pn — Ph + Yn [so — Pn Maur - (pii — Ph) 


Wii < 


+ [an — Aa + và (Fn Ts) — Gu - Aa) 

where 

* 1 m f i : 

pem tcu) , ča C (0,1], i —],...,m 
corresponds to the environment response, and p, and X, represent respectively 
the probability and the Lagrange multiplier vectors. px and X7 denote the optimal 
solution. 
In view of (4.29), which is valid for any € = En > 0, the expression is of the form 

Wrst = llai bil? + Ilaz + b21? . 


It follows that 


— u 
Wrst € (1+ én) n) 


Pn — Ph + Yn [eun — Pr + pec D 


£055 [pta - Pal? els Me ash 
+A +e) azp —agl?. 


Example 63 (Toeplitz Lemma) (see for instance [93]) Let {an} be a sequence of 
nonnegative variables such that for all n > no 


n 
0 < brp := 1 a, — oo 
n> 
tzl 
and (x4) is a sequence which converges to x*, i.e., 


* 
Xn — x". 
n> 


Then, 


n 
1 * 
— 1 AX — x. 
bn ; n->00 
t= 
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Proof Let us show that for a given e > 0, there exists an no such that for all n > no 
xn — x*]| <E. 


We derive 


n n 
l * 1 * 
—) A,X, — X $5, 2 ar |x; — x*| 
bn t=! bn 


1 tor! 
<> ed +b ae 
t=no 
1 
x no, A pu^ x at [xe — x" | +e. 


The last inequality is valid for any € > 0 and by making use of the fact that 


bn — OQ, 
noo 
we get the desired result. 
Remark 46 Directly from the previous lemma for 


an, = 1 (n = 1,2,...) 


it follows that if x, — 4 «oo x", then 
1 n 
- ) x; — x*. 
n " noo 
To 


Lemma 10 (Kronecker Lemma) Let (b,) bea sequence of nonnegative variables 
such that 


b, — œ, bai = bn Vnzl 
noo 


and {xn} be a sequence of variables such that the sum 


converges. Then 


b,x, — 0. (4.30) 
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As with the Toeplitz Lemma, the Kronecker Lemma is commonly used to state 
theoretical results in the field of stochastic processes (see for example the proof of 
Theorem 5.5.4 in [37]). It corresponds to a particular case of the Toeplitz Lemma. 


Proof Let us consider the following sequences 


ay = by — by, bo —0 (4.31) 
n 
Sn+1 = $ x. 
k=1 


Observe that 
Xk = (Sk+1 — Sk). 
Substituting (5,41 — Sy) for x, into (4.30), we derive 
le ic 
D» » d bm 2 Gin — Sx). 


By adding and substracting’ the term b,-  S; to the right-hand side of the previous 
equality, we obtain 


ic le 
5; 2, bv = v 2 de (Sets = S0 — bia Sk + be-i Sk 
umm n k=l 


l n 
= — b, S, — Sk (by — by 1) — by 1 Sk. 
Pa 2, 5 Sin k (bk — bk-1) — bk-1 Sk 


Taking into account (4.31), we obtain 


je 1 jae 
— So baxr = — D> (bkSk+1 — bia SO — — y aK Se 
bn k=l bn k=1 bn k=l 


Let us now calculate the term 


1 n 
— (bkSk+1 — bk-1 Sk) 
b 

5 kzl 


i 
in [(b1 S2 — boS1) + (b283 — b152) + +++ + bp Spi]. 
n 


7 Adding and substracting the same term to a given expression is very simple, but in some 
cases it represents a judicious tool for carrying out more convenient formulas. 
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Finally, we derive: 


1 | 
b bon = $0.1 — b. as. 
n k=l n k=l 


In view of Toeplitz Lemma for x, = Sg, we get 


noo b, 


1 n 
lim — 9 bu = S-S=0. 
k=l 


Example 64 Consider the following expression (see (3.24) in the proof of 
Theorem 3, Chapter 3 in [91]) 


= 1 
JE | = 1|Fn-1; Un = u(i) p Pali), 
Pn+1(@) 


where u (a) represents the optimal action of the automaton, and pn+\(a) is its 
associated probability. 


Let us now add and substract the following term: 


1 
E {pn+1 (@)|Fn-1; un = u(i)) 


We have that 
= 1 

—————— -1 i) + Sn, 4.32 
D T T ) pati ^ d 
where 


= l 


l 
n=) E} ——— — 
à 2 -o E{ pn41(@)|Fn—13 un = u(i)) 


Fn-13Un = uc Pni(i). 
(4.33) 


There exist many different ways of proving the Kronecker Lemma (see for instance 
Lemma 2 of Chapter 5, pp. 114—115 in [21]). The Kronecker Lemma is commonly 
used for the proofs of many results (see for example Theorem 4.1, p. 27 in [59]). 


The next example concerns the Kolmogorov and Khinchin Theorem. 
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Theorem 6 (Kolmogorov and Khinchin) Let x1, x2, ... be independent random 
variables with a finite mean. If 


oo 
> Var(xn) < oo, 


n=! 

then 
bs - Eb] 
n=} 


converges almost surely. 


Proof We shall present a proof involving the Robbins-Siegmund Theorem. 


n 2 n-l 2 
Sat » [xr - ets) = (Y [x — E{xe}] + [xn — in) 
t=1 t=l 


n-l 


= Sn—1 + D ~ Elxnd] $ [xr — Eb] + [xn — Eland. 


t=} 


Then in view of the independency of {xn}, we derive 


n—1 
E {i - Elxn)) Y [x — Ele] na 


t=1 
n-l 


Ellen — Elx) Fn) Dr — Efx) = 0 


mr 
and hence 
E{sn|Fn-1} = 52-1 + Ellxa — E{xn)}} 
= Sn—1 + Var(xn). 
Then, by making use of the Robbins-Siegmund Theorem, for 
n = Var(xn), An = Nn := 0 


we obtain 


Elsn|n-1) = Sn-1 (1 + o) — Nn + Pn, $ @ +b) < oo. 


t=1 


Then s, converges a.s., and we obtain the desired result. 
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Another proof is presented in [103]. 


The next example concerns the Kolmogorov strong law of large numbers (see for 
instance [6]). 


Example 65 (Kolmogorov strong law of large numbers) Let X1, X2,... be 
independent random variables with finite means and variances, and {bn} be an 
increasing sequence of positive real numbers such that 


b, — oo. 
If 
a D 
M oo, 
then 
— E{S, 
El n} a.s 0, 
where 


Sn =X, +---+ Xn. 


The proofs of the Kolmogorov Law of large numbers presented in [97, 60] make use 
of the Toeplitz Lemma. A theorem which is very useful for the characterization of 
the strong law of large numbers for independent random variables in term of mag- 
nitudes of the variance is given in Appendix A (see Theorem 5.2.2. in [104]). This 
theorem corresponds to what is called the Kolmogorov exponential inequalities. 


Proof Let us consider the following sum: 
Yt (= 5E) y e 
n-l (b n ) 


which by assumption is finite. In view of the assumption (X; has a finite variance, 
i=1,...), we obtain 


c^. Var Xn 
SEE <0 
n-l (bn) 
In view of the Kolmogorov and Khinchin Theorem, we obtain the convergence to 
zero of 
= Xy — E{Xx} as. 
> —————— — 0, n —» œ. 


kel bk 
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Observe that 


Sn —E{Sn} 1 Xy — E{Xx} UE 
bn obs Ys 


then, in view of the Kronecker Lemma, we derive 


Sn ur E{Sn} es 


0. 
bn 
Example 66 Find an estimation for the following term: 


Vn 2C 1 


CI ce uL ec aes, AMA 
(n — 1? pr (a) (1 — yn? [pa (o) (1 — Yn) + vac /(N. — 1)] ind 


8s — 


where 0 < y, < 1, c7 > Oand pa (a) > 0 represents a probability. 
Solution Let us consider the following expression: 

(1 —o)x t ay, 
where 0 <a < 1, x > O and y > 0. It follows that 


(l-a@)x+ay>(l-a)y+ay=y ifx >y 
(l-a)x+ay>(l-a)x+ax=x ify>x 


from which we derive 
(1 — a)x +ay > min{x, y]. 


In view of this expression, we derive for a :— Yn, x :— pa(a) and y :2 c /(N — 1) 


[i . c 
Y | = min [pe << | 


[pacar — Yn) + 


and, consequently, 


IEEE o 
57 (a — p.a) (1 — ya)? [Pn (@) (1 — Yn) + yac /(N — 9] 
v2c 1 


=a — 1? p«(a) (1 — yn)? min(ps(o), c7 /(N — 1)} 


The Hajek-Renyi inequality (see Appendix A, [97]) was used by Gong et al. [36] 
in proving the statement of their lemma numbered 2.1. 
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Lemma 11 (Lemma 2 in Liu et al. [56]) Let (xn,n > 1) be a sequence of 
nonnegative random variables adapted to the sigma algebra Fn, and such that 
Xn < 1 for all n > 1. Let X; denote the mathematical conditional expectation 


E (xi | Fn} of xi. Next, let 


A5 x 
Vs eu Ve >1, A»-0, Vo(A)=1. 
n(A) "T [1 6.- D&] n= > 00) 


Then (V,(A),n > 0) is a nonnegative supermartingale with respect to the o -algebra 


n. 


Proof Obviously, (Vn (A), n > 0) is Honnepalive. By making use of the following 
inequality (see Lemma 1 in Liu et al. [56]) 


xIna<infl+Q-—1)x], 
we derive 
Mm «1-2 (A — 10)xs 
and consequently 
E[A* | Faai} 1A- 1), 


which leads to 


Am 


aE} < Va-1 (A). 


E{Vp (A) | Fn-1} = Va-i ae 
8 Let us consider the following function: 
fO) 2» In[1 + A — 1)x] 2 xIn2, fo0«x-«1, 0«2A«oo, 


then 


df(t) A- DU -xx 
dà ALL + (A-Dx] 


It follows that 
FO). o forà > 1 and FO) <4 forü <A <1. 
dA dX 
Observe that f (1) = 0, and we derive 


f0)z0, ford >t. 


Analysis of Recursive Algorithms 253 


In other words, (V, (A), n > 0) is a nonnegative supermartingale with respect to the 
o-algebra Fn. 


The Lipschitz continuity is used in order to state many theoretical results. 


Definition 39 (Lipschitz continuity) 4 function f(x) is said to be continuous 
on the segment [a,b] if there exists a nonnegative constant K > 0 such that for 
any xı and x2 € [a,b] 

If 2) — faDI S K Ix2 — xi. (4.35) 


The constant K is called a Lipschitz constant for f (x) in the interval [a, b]. 


Inequality (4.35) can be used to get the accuracy level e related to the quantization 
number N when an optimization problem on a continuous set is transformed into 
an optimization problem on a discrete set (see Chapter 3, Lemma 1 in [91]). There 
exist many results invoking this inequality (see for example the first assumption of 
Theorem 1.2 of Section 1, pp. 2-5 in [59]). 


The next example corresponds to a simple proof of a lemma due to Neveu [82]. The 
proof presented in what follows is based on the direct use of Robbins-Siegmund 
Theorem. 


Lemma 12 (Neveu [82]) Let {Tn}, {an} and {Pn} be sequences of nonnegative 
random variables adapted to an increasing sequence of o -algebras {Fn} such that 


E{Tn+1 | Fn} € Tr — On + Bn. 


If 
oo 
a.s. 
»» Bn < oo, 
n=l 
then T, converges almost surely to a finite random variable T, and 


oo 

a.s. 
J Qn < OO. 
n-l 


Proof The statements of this lemma follow directly by making use of the 
Robbins-Siegmund Theorem (see Appendix A) for W,:— Ta, Bn = B, and 
On = Q, Nn = Qn. 


As an example of the use of the Markov inequality (see Appendix A), let us 
mention the study carried out by Tsai [107]. This study presents the conditions that 
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are sufficient and nearly necessary for the compact and bounded law of iterated 
logarithm for Markov chains with a countable state space. 


As an exercise, the reader can do again the calculations done in [98]. In fact, the idea 
of the proof of the lemmas and theorems presented in [98] are similar to the simple 
examples presented in this section. The authors carry out very simple calculations 
and they appeal to the Cauchy-Schwartz inequality and to a well-known martingale 
limit theorem [104]. 


In this section we have presented most of the tools which will be used to harvest 
the results in the later sections. The next section presents an optimization algorithm 
of the stochastic approximation type and its detailed analysis on the basis of the 
Robbins-Monro Theorem. 


4.4. Case 1: Single Learning Automaton 


In this section, we present an improved version of the reinforcement scheme 
algorithm used as a multimodal searching technique and developed by McMurtry 
and Fu [61]. This algorithm uses an averaging procedure and operates in an 
environment with continuous responses (S-model environment). We shall be only 
concerned with the single-dimensional case. 


In order to prevent degenerate situations where the realization of the function to be 
optimized is equal to zero, we introduce an auxiliary strictly positive regularizing 
parameter. This parameter prevents degenerate situations where the realization of 
the function to be optimized is equal to zero. The resulting 5-regularized function 
is bounded. 


The probabilities associated with the actions ofthe automaton are directly calculated 
from the realizations of the function to be optimized. In general, the realizations of 
the function to be optimized can be negative. As a consequence, the algorithm does 
not ensure that the probability measure is preserved. In order to avoid this problem, 
we introduce a projection procedure [68, 91] and a regularizing parameter. 


We carry out a vector formulation for this algorithm, which is more suitable for 
convergence analysis. Based on the Robbins-Monro techniques, we derive the 
asymptotic properties of this optimization algorithm. In order to apply the results in 
optimization, two corollaries are stated for function minimization and maximization 
purposes [76]. 


9 This section is based on K. Najim, P. Del Moral and E. Ikonen, An improved version 
of the McMurtry-Fu reinforcement learning scheme, /nternational Journal of Systems Sci- 
ences, vol. 34, pp. 37-47, 2004. Reproduced with permission from Taylor & Francis Ltd., 
http://www.tandf.co.uk 
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4.4.1. Learning Automata 


In Section 4.2, we briefly presented the McMurtry-Fu multimodal searching 
algorithm. An extended version of this algorithm is thereby presented and analysed. 
The main problem treated in this section may now be formulated. 


Optimization Algorithm 

Let f (x) be the real-valued function to be optimized. Only samples of the disturbed 
values of f (x) at various settings of x can be observed. The noisy observation y, 
for a given action x, is defined by 


yn = f (Xn) + Wn. (4.36) 


The observation noise sequence {wn} consists of independent and identically 
distributed (iid) random variables with a common distribution law 


P(w, € du) = p(du). 


In addition, the observation noise is assumed to be centered, i.e., 
f uu(du) = 0. 


The argument x is assumed to belong to the segment [xmin, Xmax], which is 
discretized into N intervals. Observe that this assumption is not restrictive. Let 
un € [u(D,u(2),..., u(N)] be the action selected by the automaton at time n, 
which corresponds to the argument x, € [x (1), x (2), ..., x (N)] selected at time n. 
We consider that: 


(i) x(i) corresponds to the center of the interval considered; 

(ii) there is a unique correspondence between the actions u(i) and the arguments 
x (i), i.e., u(i) = x(i). In view of Lemma 5, Chapter 3, it follows that an 
increase of the quantization number N leads to an increase in the accuracy. 


Let us consider the following averaging technique [61] 


Zn+1 (i) = Zn (i) + (1 — æ)zn (i), 0<a<l (4.37) 
Zn41G) = Zn (j), j Fi, (4.38) 


where o represents a forgetting factor, and z,,(i) corresponds to the realization of 
the following cost function: 


Zn41) = $ Yn G)) = Cf (x()) + wn), (4.39) 
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where $(-) is a strictly monotonic bounded mapping R — R+. Notice that (4.39) 
can represent any performance index (performance measure, criterion, function, 
etc.) to be optimized. 


Remark 47 In the McMurtry and Fu algorithm [61], at time n the function $(yn) 
has the form 


H y? 
( =) , B=cte>0, H=cte>0 (4.40) 
Yn 


with 8 = 0, where a v b stands for the maximum of a and b. Note that in (4.40) 
an auxiliary strictly positive parameter ô > 0 prevents degenerate situations when 
the realization y, = 0. By direct inspection, we observe that this 5-regularized 
function is bounded. 


We can now reduce the optimization problem to that of finding the best strategy 
(a sequence of control actions 44). The automaton operates as follows: At each 
time n, based on the probability distribution p,, an action u, = u(i) is randomly 
selected. The realization of the function to be optimized on the corresponding 
argument xn, x, :— x(i), where un = u(i), is given by yn. The averaging procedure 
(4.37)-(4.38) will be used as a reinforcement scheme, i.e., to derive the probability 
distribution of the automaton actions. The probabilities are assigned as follows: 


SR ee Lcd ocn (4.41) 


MENT 
Note that the difference equations (4.37)-(4.41) are nonlinear and coupled. 
The bulk of the remainder of this section is devoted to the introduction ofa projection 
procedure, a vector form of this optimization algorithm, and to the analysis of its 
asymptotic properties. 
4.4.2. Projection Procedure 


As might be expected from the construction of the optimization algorithm, we have 


N 
3 pat = 1. 
i=l 


Hence, from (4.41) we obtain 


N 


N Zn) 
Y mat = >) Lx. 
i=] 


arp Peon ae 
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The function to be optimized is, in general, not positive. As a consequence, the 
condition on the positivity of the probabilities p,+1(i) is not guaranteed for all @. 
To ensure the probability measure, a projection operator IT [68] onto the simplex 


N 
$,:— f: Sone mio =n >of, i=1,...,N (4.42) 


i=l 


will be used (see Subsection 4.2.1, Chapter 3). Projections of this type are com- 
monly used in optimization problems (identification, control, etc.) where the 
parameters to be estimated have to belong to some specific domain (stability region, 
etc.) in order to ensure some desired behavior (stability, etc.) [19, 52, 57]. The 
projection operator I1 onto the simplex S (S, with n = 0) is defined as 


z ifzeS 
TG) —]z* ifzéSlz-z'l- minlz- ql. (4.43) 
qc 


Let us mention that this property is used in the analysis of recursive stochastic 
algorithms involving projection procedures. Indeed, it permits the derivation of 
upper bounds [31]. 


Following the lines ofthe methodology presented in Section 4.2, the next subsection 
presents a vector formulation of this optimization algorithm. 


4.4.3. Stochastic Learning Model 


For the theoretical considerations to come, it is more convenient to use another 
representation of the algorithm described above. We will now focus on the 
development of a vector form of the previous learning algorithm [42]. 


Let us introduce the following notations: 


Zn+1(1) 
Zn+1(2) 


Zn41 = , Vali) — $n) Un = u(i), (4.44) 


Zna-1(N) 


then the averaging procedure (4.37) and (4.38) can be written in a vector form as 
follows: 


Zu = oZ, t (1 —- o)Y,, (4.45) 
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where 
Zn(1) 
Zn (i Sci 1) 
Yn = Vn (i) 
Znli + 1) 
Zn+1 (N) 
The probability associated with the choice of a, = a(i) is given by 


Za) T€ 


Bi) = —X—————, 
YN Gnlk) +e) 


where € > 0 isa fixed small value, a regularizing parameter. In many problems the 
difficulty is not with the existence but with the uniqueness of the solution. We will 
see later that the introduction of the regularizing parameter £ ensures the uniqueness 
[42]. For the homogeneity of the notation, the terms Z, (i) + e (i = 1,..., N) and 
Zn + will be denoted by Z7 (i) and Z, respectively. 
With each function 

g: {1,2,..., N} —> R 


we associate a real-valued random variable Y, (o) defined by 


N Y, (1) 
Yao) = 3 Y. Ge) = 1e(0,...,e(N))| : |, 
i=l Ya (N) 


where ¢ represents the indicator function. 
Note that for each i = 1,...,N, 
N 
. FE . —E;: > TOP 
g[v.o1z|- [/ UGG) + 5] P+ D PLU 
j=l,i#i 
= VPRO +0 — PrOD, 
where 


Vi= J (F(x) + w)u(dw). (4.46) 
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It follows that 
— N — 
E{Yn(9) | Zn} = D> AOAO) 
i=] 
N 
=X [VORO+(1-RO)ZO)E@. (4.47) 


i=l 


For g=1, we have Y,(~)=Y,(io), the conditional mathematical 
expectation (4.47) yields that 


E(Y.(9) | Z;] = E{¥n(io) | Zs} 


N 
= V(io) Pi o) + Z zi » zo) 


izig,i—1 


= V (io) P (io) + z; (io)(1 — p, (o). (4.48) 
Let us rewrite (4.45) in a more suitable form: 
Za —Z, = yu (Z;. Yn). (4.49) 
where y = (1 — a) and U(-, -) is the mapping from R onto itself, defined by 
u (Za Yn) = Yn - Z;. (4.50) 
The parameter y determines the pitch of the algorithm. 


It is convenient at this point to notice that 


E{¥n(1) - zi oz; f 


Efu (Zi. Yn) [Z| : 
E| Ya (N) - ENIA 
VPO) MEDIE —ptQ) - ZQ) 


V(QDpEQUD +Z (Ny (1 PS (N)) — zE(N) 


(Vii) — 7% (1)) 250) 
: -U'(Z). 65D 


(V(N) —Z5(N)) PEN) 
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with the function U*: RY — RN defined by 


z(1) +e 


Vb) 
oras ) SW aw 4 Ne 


U*(Z) = (4.52) 


z(N) +€ 
V(N) - 2(N)) > 
(VON) - 200) ER Ne 


The analysis of this optimization algorithm is carried out in the next subsection. 


4.4.4. Asymptotic Properties 


Our first order of business in this subsection is to carry out the asymptotic proper- 
ties of the optimization algorithm considered. Notice that stochastic approximation 
techniques constitute the frame of many self-learning algorithms [26, 99]. They 
play a very important role in many fields such as adaptive control, estima- 
tion, data communication, etc. It should be noted also that these techniques are 
extremely noise resistant. Asymptotically, random independent additive noise is 
eliminated and does not affect the results, i.e., the optimal solution. The formu- 
lation of the optimization algorithm in the form (4.49) and (4.50) allows us to 
apply successfully the Robbins-Monro Theorem. We are now ready for our main 
result [42]. 


Theorem 7 Consider the stochastic learning algorithm (4.49) and (4.50) associ- 
ated with the cost function $(-), (4.39), with an inhomogeneous decreasing time 
step (correction factor) yn. Suppose 


oo oo 
Yn =1—Qn, — lim y, = 0, dt =, and doy < 00. 
n= r= 
(4.53) 
Then for each € > 0 as the time parameter tends to infinity, we have that 
dp EY 
where 
Va) 
V= and V(i)— foreo + w)u(dw). (4.54) 


V(N) 
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In addition, Vi = 1,2,...,N 


Vi) +e 
P a (VG) +e) 
Proof The learning algorithm (4.49) and (4.50) can be regarded as a Robbins- 
Monro type stochastic algorithm with decreasing time step y,. The convergence 


properties of these algorithms depend on the nature of the potential function U*(.). 
In our context, we notice from (4.52) that the function U*(-) is such that 


im Br) = p$ (i) = (P-almost surely). 


U*(z 20 —z- V. 
In addition, for each z € R”, z £z V, we have 
((z — V),U*(z) - UF (V)) « 0. 
To see this claim, we observe that 
(iz — V).u*t - v* (V) - ((z - V).v*)) 
- Y D -= VO) (Và) — zà)) PW 
i=l 
with 


PO =p 
MECCORIDE 


from which we conclude that 
N — 
((z — V). uw — US (V)) 2 - Sa - VOPR W < 0, 
ist 


where (-,-) represents the Euclidean scalar product. 


These two properties of the potential function U*(-) as well as the choice 
of a decreasing time step y, of the form (4.53) allows us to use the tradi- 
tional convergence results on Robbins—Monro type stochastic algorithms (see for 
instance Theorem 1.4.26 in [26]). 


From these asymptotic results, it follows that 


lim Z, =V — (P-almost surely). 
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This almost sure convergence result readily implies that for each i (i = 1,..., N), 


Vi +e 


——À————— (P-almost surely) . 
VIL) +e) 


: — H — oy 0.8. 
Jim Pali) = Poli) = 


The limiting distribution exhibited in this theorem gives more probability mass to 
the label i of the action u(i) that maximizes the function V (-). 


One of the attractions of this result is that it provides a general method of handling 
many optimization problems. In view of the preceding asymptotic properties, we are 
now in a position to examine separately the global maximization and minimization 
problems related to multimodal function optimization. From previous observations 
we propose two natural strategies for choosing the cost function. The following 
corollaries state the previous comments [42]. The first corollary deals with global 
minimization. 


Corollary 1 (Global minimization) We use the same assumptions as in the 
previous Theorem 7. If we choose the 5-regularized cost function 


H B 
$9 = (5) B=cte>0; H=cte>0 


then we obtain 


Vii) +e 
Ya (Vo) +e) 


lim pf (i) = p5,G) = (P-almost surely) 
n—»oo 


with 


2: H B 
v= | (arara) "o 


In particular, we have that 


arg min f (x (i)) = arg max pê (i) 
i=1,...,N 


in the sense (i: f (x(i)) = min; f(x(j))} = (i: DQ) = max; p$5 G1. 
The next corollary is dedicated to the global maximization. 


Corollary 2 (Global maximization) We use the same assumptions as in the 
previous Theorem 7. In addition, we assume that for alli f(x(i)) 2 —e. If we 
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choose the identity cost function 


o(y)=y, 


then we obtain 
Vi) = [veo + wl]u(dw) = f(xG). 


We conclude that for any nonnegative function f (-), 


fli) +e 


-r (P-almost surely). 
FEO) +8) 


Jim. PO = p) = 
In particular, we have that 


arg max f(x(i)) = arg max pes). 
i i N 


i= 


T izl, 

Some remarks are in order here. 

Remark 48 This theorem can be directly applied in the case where the projection 
operator (4.43) and a regularizing factor à are introduced in the optimization 


algorithm described in Section 4.4.3. 


Remark 49 The decreasing time step yn can be chosen as follows: 


with someaandb>0Q,a<1+b. 


Now we shall look at some simulations to examine the behavior of the optimization 
algorithm presented and studied in the previous subsections. 


4.4.5. Numerical Example 


The objective of this subsection is to give a glimpse of the power of the optim- 
ization approach presented in the previous subsections. We shall consider both 
minimization and maximization of multimodal functions. 


Multimodal Function Minimization 
The learning automaton with continuous input (S-model environment) described in 
the preceding subsection was used to optimize the following multimodal function: 


m. 
RET 


[esso + sin(=*) + 5 sin() + 2s] + Um. (4.55) 
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In practice, it is impossible to obtain perfect measurements. In order to get more 
realistic simulations we have introduced a disturbance (observation noise) wp, 
where wpn is uniformly distributed noise, w, € [—v, v], v = 0.1. The realizations 
(4.55) take values in the unit interval [0, 1]. More precisely, we have f (x) > v. 


From Corollary 1 (without 5-regularization), we have that 


PEN Vii)t+e 


lim pri) = pe) = == (P-almost surely) 
as Ya (Vo) +e) 


with 


= H : 
v= | (ieu) tnnt 


E 20H? 
— DG) + vl f@@) — vy’ 


where 1,..,,,j stands for the indicator function of the interval [—v, v]: 


_ fl ifwe[-v,v] 
lvo) = lo otherwise. 
The design parameters associated with the McMurtry—Fu function (see Corollary 1) 
were selected as follows: H = 0.01 and f = 2. 


At each iteration, the optimization algorithm based on the learning automaton 
performs the following four steps: 


Step 1: Choice of an action u(i) (x (i)) on the basis of the probability distribution pn. 
The technique used by the algorithm to select one action u(i) among N actions is 
based on the generation of a uniformly distributed random variable (any specific 
machine routine; e.g., RANDU, can be used to carry out a uniformly distributed 
random variable). 


Step 2: Observe the environment response and compute (4.39), (4.37) and (4.38). 


Step 3: Use this response to adjust the probability distribution according to the 
adopted reinforcement (learning, adaptive) scheme, equation (4.41). 


Step 4: Return to Step 1. 


The correction factor was selected as follows: 
pedem. (4.56) 
n 


with cg = 0.5 and where o corresponds to the weighting parameter (forgetting 
factor) o in (4.37). 
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The number of actions was chosen to be equal to N = 16. We remark that 
it is up to the designer to specify the number of actions N of the automaton. 
This specification can be done according to Lemma 5, Chapter 3. The interval 
[Xmin: Xmax] was partitioned uniformly into 16 values. The actions x (i) were placed 
at equidistant intervals at x(1) = 1.5, x(2) = 2.7,..., x(N) = 19.5. The optimal 
argument was x* = 9.9. No prior information was utilized in the initialization of 


the probabilities, i.e., 
fi tin kd 
Po = Woe wy 


The algorithm was started with a purely uniform distribution for each action. 


Figure 4.1 shows a simulation for n = 1,2, . . . , 3000. The upper-right comer shows 
the noiseless function and a few samples at x(1),...,x(N). The evolution of the 
costs Z, is shown on the bottom left. The probabilities p, and the time-varying o, is 
shown in the bottom-right plot. The final values for the probabilities, at n — 3000, 
are shown on the top left. 


After less than 1000 iterations all the components of the probability vector 
converged practically to their final values. The component p(8) continues to 


probabilities multimodal function 
0.35 1 
0.3 
0.8 
0.25 
02 0.6 
* = 
0.15 04 
0.1 
02 
0.05 
0 0 
0 5 10 15 5 10 15 20 
i x(i) 
x 10 evolution of zbar evolution of probabilities 


0 
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 
n n 


Figure 4.1 A multimodal function and the behavior of the self-learning 
optimization algorithm 
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increase slowly. It can be noted that the probability distribution plot can be regarded 
as a “negative” of the function to be optimized. 


Starting from the 1500th iteration, we notice that the probability p(7) changes very 
slowly compared to the probability p(8); the other probabilities tend quickly to 
their final values. As is expected, the lower (right) plot shows that the parameter 
a, (4.56) increases to 1. The final values associated with the probabilities 
p(T) = 0.2823 and p(8) = 0.3168 were very close, with corresponding arguments 
x(7) = 8.7 and x(8) = 9.9. Nevertheless, the minimum of the function f (x) 
is close to these values. This behavior was expected from the theoretical results 
stated in the previous subsection. 


These simulations are indicative of results that can be obtained in real situations, 
and demonstrate the efficacy of the optimization algorithm described above. These 
kinds of results are important for engineers involved, for example, in the optimal 
design of plant structures (synthesis of distillation sequences, etc.). They are mainly 
interested in the determination of the optimal structure, but also in solutions close 
to the optimal one. In other words, they need some confidence level because the 
models as well as the cost evaluations they use exhibit uncertainties. 


We are now going to consider the maximization of a multimodal function. 


Multimodal Function Maximization 
The next simulations concern the maximization of (4.55), and the following 
unscaled function 


: l. x 
Yn = COS(Xn) + sin($) + P] sin v + Wn, 


where w, was selected as Gaussian noise. The correction factor was selected as 
follows: 


co 


= 4.57 
YU xe = a) (un 


l— an = 


where 


ex |l ifu, = uli) 
x(u = u()) = lo if uy x u(i). 


The outline of computations at each iteration was as follows: 


select action u, = u(i); 

get evaluation (realization) yn; 

compute a,; 

Set Zn (i) = Yn; 

update Zn+1 (i) = &nZn (i) + (1 — On)Zn(@) and Zn+1 (j)’s, j Fi; 
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e compute pn(j) = a+ (G) + €)/D fey Uni) + €] for all j; 
e project p4(j) onto the simplex such that p,(j)>e for all j, and 


Erima) = 1. 


As for the previous simulations, we used an automaton with N actions (N = 16). 
The following choices were made: e = 10~ and co = 1; the elements of Z, were 
initialized as 1. 


In practice, it is impossible to obtain perfect measurements. We therefore introduced 
a Gaussian noise wg. Indeed, randomness must be incorporated into every simu- 
lation that is to be an adequate mirror of reality [25]. The simulation results for 
(4.55) are depicted in Figure 4.2. The final probabilities are shown in the top- 
left corner. The values obtained by the maximization algorithm are represented by 
crosses while the values given by Corollary 2 are represented by circles. We observe 
that the fifth action, which corresponds to the optimal value f (x*) = 1.4914, was 
selected. Notice that the forgetting factor œn, given by (4.57), tends to 1. At the 
bottom of Figure 4.2, the evolution of the probabilities and Z, (i) (i = 1,..., N) is 
depicted. 
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Figure 4.2 Final values of the probabilities after 5000 iterations, disturbed multimodal 
function, the probabilities versus time, and the evolution of the averaged realization of the 
function to be maximized. These simulations correspond to a multimodal function 
corrupted by a uniformly distributed noise [ — 0.1,0.1] 
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In the next simulations we considered the unscaled multimodal function corrupted 
by a Gaussian noise wn ~ N(0,0.1). As before, the optimal value of the argument 
is x* = 6.3, f(x*) = 1.4914. The correction factor was selected as in the first 
simulations dealing with the minimization of a multimodal function (4.56). The 
function to be optimized was not positive. As a consequence, some of the compon- 
ents of the probability vector would be negative unless the projection procedure 
(4.43) was implemented. 


The behavior of the optimization algorithm is illustrated in Figure 4.3. The final 
probabilities are depicted in the top-left corner. The values obtained by the optim- 
ization algorithm are represented by crosses, the values given by Corollary 2 are 
shown by circles. The circles and the crosses do not coincide because the projection 
procedure was used. From the simulations, it is clear that the fifth action, which 
corresponds to the optimal argument (x* = 6.3), was selected. The evolutions 
of Z,(i) (i = 1,..., N) and the probabilities are shown at the bottom of Figure 
4.3. They converged to their final values after approximately 1100 iterations. The 
forgetting factor a, tended close to 1 in less than 200 steps. 
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Figure 4.3 Final values of the probabilities after 5000 iterations, disturbed multimodal 
function, the probabilities versus time and the evolution of the averaged realization of the 
function to be maximized. The multimodal function is corrupted with Gaussian noise 
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This subsection was concerned with an improved version of the reinforcement 
scheme algorithm originally developed by McMurtry and Fu, used as a multimodal 
searching technique for solving stochastic optimization problems. In summary, to 
verify the results obtained by theoretical analysis, a number of simulations were 
carried out on the optimization of a multimodal function. Notice that a suitable 
choice of the number of actions N is an important issue and depends on the desired 
accuracy. The optimization algorithms implemented in this paper gave satisfactory 
results, and the results obtained by theoretical analysis were verified. These results 
can be explained by the adaptive structure of the optimization algorithm and the 
random selectioning of the actions. Finally, we point out that the self-optimization 
algorithm needed few programming steps and little storage capacity. 


We considered only the single-dimensional case. In order to ensure the probability 
measure, a projection procedure was introduced. We also introduced a regularizing 
parameter £ in order to ensure the uniqueness of the solution. A vector version, more 
appropriate for convergence analysis, was derived. A decreasing sequence was also 
considered for the correction factor. An auxiliary strictly positive parameter 5 > 0 
was introduced, which prevents degenerate situations where the realization of the 
function to be optimized is equal to zero. 


We end this part with some conclusions. 


4.4.6. Conclusions 


The basic problem in the design of a learning automaton is the development of 
a suitable updating reinforcement scheme for a specific behavior of the auto- 
maton. The asymptotic properties of the improved McMurtry-Fu scheme were 
stated using the traditional convergence results on the Robbins-Monro type of 
stochastic algorithms. An illustrative computer simulation example was provided 
to demonstrate the validity of the suggested optimization algorithm. The theory of 
learning systems assures that no (or little) a priori information is needed about the 
random environment (the function to be optimized). Nevertheless, the eventually 
existing information can be introduced through the initial probability distribution, 
orthrough the design ofthe environment response (cost function and normalization, 
etc.). Finally, we conclude that the results obtained seem very attractive. 


4.5. Case 2: Team of Binary Learning Automata 


Many real optimization problems demand a collective (group) solution where 
each participant expresses its attitude to a current environment situation using 
only a binary response (“yes” or no"). In some sense, the behavior of the par- 
ticipant should be synchronized to obtain a common optimization aim. Frequently, 
the information necessary for solving a problem (control, optimization, etc.) is 
not available or may be incomplete. It is then necessary to learn (acquire) addi- 
tional information. Learning deals with the ability of systems to improve their 
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response (performance in the sense of some criterion) based on past experience 
[109]. Learning models stem from diverse approaches, frequently grounded on 
heuristic intuitions and experiments. In the case of finite (discrete) solution sets, 
the more adequate operation frame is the learning automata paradigm [68]. 


Learning automata have been used to solve engineering problems as well as prob- 
lems stemming from economics, characterized by nonlinearity and a high level of 
uncertainty [62, 66, 68, 91, 109]. They have been used for process modeling and 
control, optimization, pattern recognition, image processing, signal processing, 
trajectory planning of robot manipulators, telephone and internet traffic rout- 
ing, process navigation, neuro-fuzzy network training, process synthesis, etc. 
[14, 18, 33, 40, 43, 54, 65, 70, 83, 105, 106, 110, 113]. 


This section presents another recursive stochastic algorithm and its analysis. We 
introduce the algorithm that has been developed for optimization purposes and we 
state its asymptotic properties on the basis ofthe Lyapunov approach and martingale 
theory. The Bush-Mosteller scheme is considered. A detailed derivation of the ana- 
lysis forthe case ofa single automaton using the Bush-Mosteller scheme is available 
in [75]. In what follows, we consider an optimization algorithm based on a team 


of cooperative learning automata, the analysis of which follows the same lines!?. 


Binary coding is common with genetic algorithms. Howell proposed a genetic 
learning automata (GLA) optimization algorithm [41], applied to probabilities of 
binary actions in a team of learning automata. In the GLA algorithm, the popula- 
tion consisted of strings of binary-action learning automata probabilities. At each 
generation, a set of sample vectors was generated using the probability distribution 
in the population. The sample vectors were evaluated, and a max—min normaliz- 
ation conducted within the current population. The probabilities in the strings of 
the population were then updated using a reinforcement scheme. Crossover and 
reordering operators were applied before proceeding to next generation. 


In this section, an optimization algorithm is developed based on a team of learning 
automata with two actions (one action is equal to 0 and the second one is equal to 1). 
The outputs of the team of learning automata form a binary number which con- 
stitutes the environment input. The continuous environment response is obtained 
from a realization of the function to be minimized, and a normalization proced- 
ure is then used to ensure the preservation of the probability measure. Notice that 
the environment response is not binary but belongs to the unit segment (S-model 
environment). Each automaton in the team is provided with the same normalized 
environment response. The Bush-Mosteller reinforcement scheme with a continu- 
ous input (continuous environment response) and a time-varying correction factor 
is used for updating the probabilities. 


10 This section is based on K. Najim, A. S. Poznyak and E. Ikonen, Optimization based on 
a team of automata with binary outputs, Automatica, vol. 40, (in press). Reproduced with 
permission from Elsevier. 
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The main contribution of this section considers the theoretical results on the asymp- 
totic properties of the learning system. The analysis of the behavior of a team of 
automata with two actions has not been carried out before. A new type of normal- 
ization procedure is also constructed. The tools (martingale theory and Lyapunov 
approach) used in the derivations that follow are similar to those used in previous 
studies carried out by the authors. The main obstacle to using the Lyapunov theory 
is in finding a suitable Lyapunov function. For our algorithm, a Lyapunov function 
is at our disposal. The main theorem used for the analysis is the Robbins-Siegmund 
Theorem. 


The next subsection deals with the definition of a team of learning automata and 


the presentation of the reinforcement scheme (adaptation mechanism) used in the 
adaptation procedure. 


4.5.1. Stochastic Learning Automata 
A k-automaton (k — 1, N)!! with binary output, belonging to a team with N parti- 


cipants and operating in a random environment (medium), is an adaptive discrete 
machine described by 


[S U*, (55), [un]. [pn]. T} 


where: 


— 


& is the automaton input bounded set; 

2. U* denotes the set (u^(1) = 1,u*(2) = 0} of actions of the automata (k = 
1, N ) (we consider a team of N automata), and {u*} is a sequence of binary 
automaton outputs (actions): u* = {0; 1}; 

3. (E) is a sequence of automaton inputs ( payoffs £% € &) provided by the given 
mechanism in a continuous (S-model environment) form; 

4. =Í pEQ), p*(2)}" is the conditional probability distribution at time n: 


kr: 
pk(i) = Pr fo E Q:uk = x 
2 
P ODs], 
i=l 


where 7, = c (u* ; pi : k S uk, pk, EK) is the minimal o-algebra generated 
by the corresponding events (F,, C F); 


11 The notation k = 1, N means k = 1,2,...,N. 
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5. T'- TF represents the reinforcement scheme (updating scheme) that changes 
the probability vector p* to p£ , ,, that is, 


pL = PË + v ETE E (EB, aus licia) (4.58) 


pk (i) > 0, i = 1,2, where yk is a scalar correction factor and the vector 
TK) = [P C), TE^ satisfies the following conditions (for preserving 
probability measure): 


N H 
mr =o 
i=] 
pEG) + yn TH() € [0,1] Wn, k = 1,..., N. 


The environment establishes the relation between the actions of the automaton and 
the signals received at its input. It includes all external influences. The environment 
produces a random response whose statistics depend on the current stimulus or 
input. 


4.5.2. Optimization Algorithm 


Several engineering problems require a multimodal functions optimization strategy. 
Usually, the function f (x) to be optimized is not explicitly known: only samples of 
the disturbed values of f (x) at various settings of x can be observed, complicating 
the application of the usual numerical optimization procedures. 


Let us consider a real-valued scalar function f (x), x € [Xmin; Xmax]. We would like 
to find the value x — x* that minimizes this function, i.e., 


x* — arg min f(x). (4.59) 


X €X —[Xmin; Xmax] 


There are almost no conditions concerning the function f (x) (continuity, unimod- 
ality, differentiability, convexity, etc.) to be optimized. We are concerned with an 
€-global optimization problem of multimodal and nondifferentiable functions. 


The actions of the team of N stochastic automata form a binary string of length N: 


1 
us, ul 


pep P 


where uk = (0; 1). The quantized real value is given by 


N 
Xn = X(Uq) :— Xmin + A 9 ui2', (4.60) 


i=l 
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where x, € X, X=({Xmin, Xmin +A, Xmin +2A,..-; Xmax} and u,:— 
(u L u?, pees uN )?. The resolution of the quantization is equal to 
Xmax — Xmin 
A=———_. 4.61 
aN a] (4.61) 


Without loss of generality, we can assume that xmin = 0. Let y, be the observation 
of the function f (x) at the point x, € X, i.e., 


Yn = fn) + wn, (4.62) 


where wpn is the observation noise (disturbance) at time n. 


We assume that the observation noise is a sequence of independent random variables 
with conditional mathematical expectations equal to zero, and finite variances, i.e., 


H1: The conditional mathematical expectations of the observation noise w, are 
equal to zero for any time n = 1,2,...: E(w; |Fn-1} = o, Fn-1 1= O (Xs, Ws; S = 


l,n — 1), i.e., {wn} is a sequence containing martingale differences. 


H2: The conditional variances of the observation noises exist and are uniformly 
bounded: E{w?|F,_1} = o2(i), max; sup, o2) := 0? < oo. 


The optimization algorithm operates as follows, see Figure 4.4. At each time n, 
each automaton of the team selects randomly an action uk (k = 1, N ). According 
to (4.60) and ( 4.61), these actions are in turn used to calculate the new value of 
the argument xn, and then the realization y, of the function f (xn) is obtained. This 
realization is then normalized as follows: 


^ Sn (i1,...,iN) — MIn i jy Sn-1 Jp s J 
Ê, - [Sn (1 N) i jis dn n-1C1 Ju +1, (463) 
maxi, iw [5n G1, ..., iN) — minj,, jy Sani. JN)] 
where 


Dh ye xot = u(i) 


Sp »olv) = (4.64) 
Ya Hii x uk = ukli) 
with 
eee bs ifx > 0 
[hes 0 ifx «0 
(4.65) 


| ifuk — u*(i) 


k pkr 
x =u o=, a 
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Observation (realization) 
of the function f(x) 


1 
Un 


Figure 4.4 Schematic diagram of the optimization algorithm 


The normalized environment response belongs to the unit segment, £, € [0, 1]. It 
is then used as the input of all the automata belonging to the team of automata, 
that is, 


zk 


(6, represents the input of the kth automaton). We are dealing with a random sta- 


à . zk ; AN 
tionary environment where responses £, are characterized (in view of H1 and H2) 
by the following two properties. 


Lemma 13 Assume that assumptions H1 and H2 hold and suppose that the 
considered reinforcement scheme (4.58) generates the sequences { pt) such that 
for any index collection (i, . . . , iw) the following “ergodic condition" is fulfilled: 


oo N 
YT] eked) € o. (4.66) 


nzlkzl 


Then the normalized environment response È n (4.63) possesses the following 
properties: 


e The number of selections of each action collection (i1, . . . , iw) is infinite, i.e., 


coo N 
X [xe] = uk Ge) S o. (4.67) 
t=1 k=l 
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e The random variable s, (ij,...,iN) (4.64) is asymptotically equal to the 
value of the function to be optimized for the corresponding point x (u! (i1), 
u? (i3), ..., UN (in)) belonging to the finite set, i.e., 


sn(ii,--- iN) = f (œu! (iq), u? (i2), -U in) -- ou (D, (4.68) 


where 0w(1) denotes any random sequence tending to zero with probability 1. 
e For the selected actions uk = u* (iy) at time n, the normalized environment 
reaction £, is asymptotically equal to A(i1,.. ., iN), ie. 


^ 


En E A(ii,...,in) + ow(1) € (0, 1], (4.69) 
where 


[f(xii,.-.,in)) NS f(x(ai,...,aN))] 


[max ipo Lf G 01. iN) — F(x(ay,...,@n))] + 1] 
(4.70) 


A(i,...,in) = 


with A(i1,...,in) €[0, D and (a1, ..., a y) :— arg ming, iy) f(x (i1... iw)). 
e For the optimal action un = u(x(a},...,a@N)), the normalized environment 
reaction is asymptotically equal to 0, i.e., 


En = o, (0) (4.71) 
if Un = uU(x(Q,...,@N)). 
The proof of this lemma is based on the Borel-Cantelli Lemma, the strong law 


of large numbers and the Robbins-Siegmund Theorem (see [93], Theorem 2 in 
Appendix A). 


Proof 


e Equation (4.67) follows directly from the assumption (4.66) and the Borel- 
Cantelli Lemma. 
e Letus introduce the following sequence 


On s... IN) = Salit,- IN) — fx. ss iN)). 
Using (4.64) and (4.62) gives 
Yale xuk = uto) 


È Dini Wr Tlz-i x (uf = uk (iy) 
Dia pr x(uk = uk (i) : 


On (,..., iN) = 
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which leads to the following recurrent form for 6, (i): 
0,1, ...,iN) = [1 — Ang... in) JOn—-1 (i1... iN) + An ns EN) Wn, 
where 


TAa x (dh = uti) 


Àn(it,... iN) = LMÁ——————————, 
Dia i x (ut = uk i) 
Taking into account the assumptions H1 and H2, we derive 


E((65 (i1, .. iN). | nius = u(it,...,in)} 
= [1 — An (i, ..., iP 02. V, Liv) A2, i002) 


< [1 — Asi, i)E 02 Gn. Li) AZ... iw)02. (4.72) 


It is easy to see that (4.67) implies 
n n 
a.s. 
Doan) S00, — Xo € ov. 
t=1 t=1 


In view of these facts and the Robbins-Siegmund Theorem, it follows that 
65, ... LIN) ns oo0 and hence 


lim s, (/,... iN) = fæli... iN)). 
noo 


So, (4.68) is proven. 
e (4.69) follows directly from (4.68) and (4.66). 
e (4.71) is the consequence of (4.69) and (4.70). 
Lemma 14 [ffor some reinforcement scheme the following inequality holds: 


1 n N 1 1 
| 
a X I] p, = O (=) ; TE (o, 3 (4.73) 


t=] k=l 


then for any small positive & this implies 


a.s. l 
Sn(il,... iN) — fxi... iN)}) = 0, (m=) (4.74) 


nl/2-1-€ 


and, as a result, for large enough n > no(w), it follows that 


$4546 : l 
E, as A(i1,...,iN) + Ow (m=) € [0,1] (4.75) 
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^ a.s. 1 
Fé, — Alii... iN) Ik = T, N} S (7) (4.76) 
with uk = u* (iy), and 
1 
EE, |k = 1,N] = Ai... sito| =) (4.77) 


The proof is based on Lemmas 4 and 5 in [93], Appendix A. 


Proof Notice that according to the strong law of large numbers [92] and 
Lemma 4 (Appendix A in [93]), it follows that 


p» ira x(uf = uf (ig) — [ote Sona 0/270, (4.78) 
k=1 k=l 


Let us search for the lower and upper bound of àn (i1,..., iN): 


IA, xc = ut) 
Yn a HI xt = uk (ik) 


Substituting (4.78) into (4.79) gives 


Àn(il,... iN) = (4.79) 


N k— kr: 
An... iN) = Mrs x =u (ik) l 
n [0/0 X T pilin) o5 -02-9)| 


which implies for large enough n > no(w): 


Teer x (uf = uke) [Ti xt = vto) 


n[1 + 0,(n-G2-)y] 7 : [1 — e]. 


An... iN) = 


Using (4.73) we have an expression for the upper bound: 


Tii x uk = uk (ig) 


Àn(il,... iN) < [O(1/n*) + o4(n-"/2-8))] 


= 


x(uk = uk (GG) + 04(1)] O(n!) 


IA 
sz 


z 
ll 


x (uk = uk GG) + e])0(n7!**). 


IA 
Jz 


* 
M 
pany 
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From the last inequality, it also follows that 
N 
As (iiss uiny < [] xe = ut (ig))[1 + e]O (n? 71), 
k=l 


In view of Lemma 4 (Appendix A in [93]), from (4.72) we derive 


a.s. l 
Sn (31, .. iN) — fO (i... iN)) = Ow (=) 


ni/2-t-e 


from which (4.74) and (4.75) follow. 


Using (4.72) and Lemma 5 (Appendix A) in [93], we obtain (4.76), from which 
according to the Jensen inequality and the following relations: 


|E {én — A(iy,...,iv)| uk = uk), k = LN] 


x E [I£, - AG,...,in)|| uk m GO. k= TH] 


1A 


à 2 ae 
[o = A(t,...,in)) | uk = uk), k = rs] 


l 
=o ni/2—t 


we derive (4.77). The corollary is proven. 


Finally, the automaton input is used in connection with a modified version 
of the Bush—Mosteller reinforcement scheme [68] to adjust the probabilities 
distributions, i.e., 


k (1) = pk) + Ap, 
ee ) = pal) + Apr a) (4.80) 


phy (2) =l- ph.) 


ak 
Apt Q0) = y! [ot - pa) +a- 2u%)], 
where 


yž e [0,1], & € [0,1]. 


The original Bush-Mosteller reinforcement scheme uses a binary input ( P-model 
environment) and a constant correction factor yk = y = const. 
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The loss function ®, associated with each learning automaton is given by 
| an Sees Oe 
o=-) 8 2 5 (4.81) 
t=] t=1 


It is a useful quantity for judging the behavior of a learning automaton. We will 
show in the sequel that if a stochastic automaton minimizes its loss function, then 
it automatically solves the corresponding unconstrained stochastic optimization 
problem on a discrete set. The resolution of the discretization is given by (4.61). 
This resolution increases with the number of automata that constitute the team. 


The convergence as well the convergence rate of this optimization algorithm will 
be stated in the next subsection. 


4.5.3. Asymptotic Properties 


In [74, 91] the analysis of an optimization algorithm based on the behavior of a 
learning automaton that uses the Bush-Mosteller [17] scheme with time-varying 
correction factor and normalization procedure has been considered. A statement of 
the asymptotic behavior of this optimization algorithm is set out in [74, 91]. 


In this study, we assume that 


H3: The correction factor y* in (4.80) is time-varying and is selected for any k 
according to the following rule: 


k_ Y 
Yn Fata’ 


y € (0,1) a>y. (4.82) 
H4: The initial probabilities are assumed to be strictly positive, i.e., 
DG) 20 Yk=1,..., N; i=1,2. 


The Bush-Mosteller scheme (4.80) can be rewritten in the following vector form: 
pkpi = pk + vf [etuk - pk £6 — 2e(uh)], (4.83) 


with y¥ = y, = y/(n +a), y € (0,1/N), a > Ny, £, € [0,1], e(uk) := 
(uk, 1 — uk)”, uk € (0; 1) and & :- (1, 1). 


Theorem 8 For the Bush-Mosteller scheme (4.83), condition (4.66 ) is satisfied, 
and if the assumptions ( HI)-( H4) hold, and the optimal action is single, i.e., 


min Alii... iy) :— A* 50 (4.84) 


(ih... EN) E (at...) 
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then the given collection of automata with binary actions selects asymptotically the 
global optimal point x(a\,...,an), and the loss function d, (4.81) tends to its 
minimal value equal to zero (mingi,,...in) A(iy,...,in) = 0) with probability 1 (at 
almost all trajectories) more quickly than 


The proof is obtained by selecting a Lyapunov function W, := (1 — 
Lh pk (ax))/ EE: pt (ax), finding an upper bound for E {W,41/F,} and using 
the Robbins-Siegmund Theorem, hence showing that W,, 23 soo 0, which implies 
IA. pt (ax) Zao 1 and £, Sao 0. A similar treatment of W, := (1 — p(a)) 
/ p(a) in the case of a single automaton can be found in Section 3.6.1 in [91] (see 
also [74]); in this paper the more complex case of a team of learning automata (with 
binary actions) is considered. 


Proof Let us consider the following estimation: 


= 


[letno = (pn + va [eio — ph 66 - 2e(uh))]) 
k=] 


= 
It 
~ 


i 
umE: 


(o [1 — ve] evt [eb E, - 2695») 


ik 


> 
Li 


IV 
—l= 
"S 
EN 
Pm 
= 
N 
— 
_ 
| 
ns 
— 
IV 


= 
ll 


N n 
[i-xT = [Jte [n - va" 


k=1 t=1 


IV 
-j> 
D 
5 
Ex] 


~ 
I 
= 
- 
Il 


piGo [ [t - Ny. (4.85) 


tzl 


IV 
ls 


~ 
0 


From Lemma A.3.3 [91], it follows that 


" (t 
]Ila-5»21*57*2 
a 


t=] 


Ny 
) a> Ny € (0,1) 
(4.86) 


Ny =1, 0. 
nca y xdi 
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Substituting (4.86) into (4.85) leads to the desired result (4.66). Notice also that 
from (4.86) and (4.85) it follows that in (4.74) 


t — Ny e (0,1) 


and hence in (4.75), we have 


z f . as. l 
E, — A(it,...,in) = 0w (zo) : (4.87) 
Select the following Lyapunov function: 


N 
Wr ‘= i IT PK (ox) 


(4.88) 
I pk (ax) 


depending only on the components p*(a,) of the mixed strategy vector corres- 


ponding to the optimal actions u^ (œx). The conditional mathematical expectation 
of this function at time (n + 1) can be expressed as 


1 — IT: pk, (ox) 
E (Wanhaa) =e Iia E I5, 
Iii P5410 


eD pel | 


k 
i=l iy=l II Pr+1 (2k) 


Fn Auk = uk (i), k = 1, ren TI. oo (4.89) 
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and 


2 N 
$2» DOT] PAGE luris (4.91) 
i=} —iy-lk-l Mi= pk, (ak) 
ATN] 
E [TIL pkp (ou) Fn Auk = GER LN| 
Fn Auk = u* (iy) k= zi (4.92) 
The c -algebra J^, is defined as 
Fri=o (p. ut wsss —-lhnn-lLkz LN). 
Using the Bush-Mosteller procedure (4.83), it can be shown that 
N N 
[rnes [I (pio + yn [eub — pk +E € - etuk) | ) 
k=l k=1 k 


= (1 — Nyn) I PE (ar) + yn ST pilas) [eD], 


k=1 k=] s#k 


+8, (1 - 2e] | + o(2). 


Analogously to [91] (see Theorem 3 in Section 3.4.1), the second term s, (4.91) in 
(4.90) can be estimated in the following way: 


2 2 N 
=a a [YnlFn Auk = vt io. k= TN) TT] phi. 


k=} 


Analysis of Recursive Algorithms 283 


Let us split s, into two terms (the term related to the optimal actions and the term 
associated with the nonoptimal actions): 


N 
E | Yn|Fn Auk = v Go. E S 1,8] TT] rh) 
k=1 


i=l 


Ew 


N 
E [riz Auk = ur ig) km I.N) [I tao 


E 


kzl 
N 
+E [v.e Auk = uta), km LA) T] phen), (4.93) 
k=l 
where 
EIE Phas eo/a uh mto FETT) - T Pha 
n — — i 


Mii Pk 0DE [TIa Phy (0/7s Auk = uk (i), E 21,8) 


The estimation 


N 
ERL 1 
E Iris. Auk = ut (a), k = 1, y) Il PX (ax) = ow (zv) 


k=1 

implies 

sn « n X0-NPC, 1 + o(1) Wns (4.94) 
where 

oes (Ny)?Cp 
UU (a= Nyyhr 
and 
: . . 2 1 
c, =E | (Ey - Ais): | uk = u* (iy), k=1, |: = (<=) 
(4.95) 


(using Lemma 14). Substituting (4.95) into (4.94) implies 


Sa € o(n NY) W, (4.96) 
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Calculating the first term s, in (4.90) in a similar way as in [91] (see Theorem 3 in 
Section 3.4.1), we derive for n > n9(w), no(w) “< oo: 


aE | 


E (TI | PE a QD | Fs A uk = uk (ik), k =], KzN| 


N 
=| [] pia (4.97) 
1 
zorl 


iy Aa inzan (E [T PE a (Qu) | Fn ^ uk = ut (ig), k = LN] 


N 
4l |o (4.98) 
k=1 


1 
Dolan te E 52 se |J Toto 
E {TM 1 PE a QUO |n ^ uk = ukt (ag), k = 1, k=1N} 


1 
< Wp (1 Š an) 4 o(n 3/282NY. (4.99) 
where 
Ny ^* 
LET ERN. 


Substitution of (4.96) and (4.99) in (4.90) leads to the following inequality: 
‘$. C2fl l 
E [Wnr] Fn] “< Wr (1 = EHE DI + o3) + o(n~3/2+2NY) 


=W, (1 - ae) + o(n73/242N7), (4.100) 


In view of the Robbins-Siegmund Theorem (Theorem 14 in Appendix A), it 
follows that 


Wa => 0, [ie 2 A 


The theorem is proven. 
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This theorem shows that a team of learning binary automata using the 
Bush-Mosteller reinforcement scheme with the normalization procedure, described 
above, selects asymptotically the optimal actions. The convergence of a recursive 
scheme is important, but the convergence speed is also essential. It depends on 
the number of operations performed by the algorithm during an iteration as well 
as the number of iterations needed for convergence. The next corollary gives the 
estimation of the rate of convergence of the optimization described before. 


Corollary 3 (On convergence rate) Under the assumptions of this theorem it 
follows that 


where 
0«v«i-2Ny. 


So, for a given y € (0,a/N), the order v of the learning rate n™” decreases if the 
number N of automata team increases. 


The proof is obtained using Lemma A.3-2 in [91]. 


Proof In view of Lemma A.3-2 in [91] for 


v 


C2[1 l 
Vp RS, Un := Wa, PER URAI 


n 


By := 0(n73/242NY) 


from (4.100), we obtain 
3 -2Ny-v» I. 


The corollary is proved. 


4.5.4. Numerical Example 


In order to illustrate the feasibility and the performance of the algorithm presen- 
ted earlier, let us consider the following optimization problem (minimization of 
a multimodal function): 


i 4.101 
sedi (9, ( ) 
where 
1 
f(x) = cosx + sin = + = sin Ž (4.102) 


2 2 4 
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15 27 39 51 63 75 87 99 11.1 12.3 13.5 14.7 15.9 17.1 18.3 19.5 
Xn 


Figure 4.5 Multimodal function. The solid line shows the true function f, noisy data is 
evaluated at points obtained using N — 4 


and w ~ N (0,0.1). x is obtained from (4.60). The task is then to find the optimal 
combination of u's, k = 1,2,..., N. Figure 4.5 shows an example of the data for 
N = 4, n = 100. This setting results in resolution of A = 1.2, corresponding to 
the search space of a single automaton with 16 actions. The minimum is found at 
x* 299 (ul = u? = u} = 1, ut = 0). 


Figure 4.6 (gray dots) shows the probabilities p*(1) in a typical simulation with 
N=4,y = 1/(2N), a = 1, to = 10000N, T = to + 10000 and 


y ift < to 
cio — ‘pet eT, (4.103) 
(t — to) +a 


It can be seen that for the first 30 000 iterations no apparent learning takes place. 
Taking the running sample means (solid lines) shows, however, that after a few 
hundred iterations the mean probabilities (1/n) 3 ^7. pa) already remain larger 
than 0.5 (for k = 1,2,3) and less than 0.5 (for k = 4). At iterations t + 31 275 — 
31300, the probabilities suddenly converge to 1’s and 0’s, and remain there until 
the iterations are terminated. 
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Figure 4.6 Evolution of the probabilities pÉO) 


In order to have a better view of the practical viability of the approach, 50 repeated 
simulations were conducted using N = 2,3,...,8usingaconstant y, = y = 0.25. 
Simulations were stopped when the probability for selecting a single action in the 
discretized search space was 99.99%, i.e., I maxi pk (i) > 0.9999. The 
results are summarized in Figures 4.7 and 4.8. Figure 4.7 shows the number of 
times a particular solution was obtained (out of the 50 test runs). For N = 3, 4 and 
5, all 50 simulations converged to the optimum x*. For N = 6,7 and 8, the correct 
optimum was found in 47, 19 and 12 test runs, respectively, while the other solutions 
were found in the close neighborhood of x*. Figure 4.8 shows the histogram of 
the distribution of the number of iteration rounds required until convergence. It 
is clear that the precision increases with the number of bytes considered for real 
number coding. However, the number of required iterations also increases with N, 
as expected. 


4.5.5. Conclusions 


This study demonstrated a potentially powerful tool for optimization purposes. 
The approach adopted here was based on a team of learning stochastic automata 
with a continuous input (environment response) and binary outputs. A modified 
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solution vs. # choices 


Figure 4.7 Frequencies of the solutions found 


version of the Bush-Mosteller reinforcement scheme was used for the adjustment 
of the probability distributions. The asymptotic properties of this random search 
optimization technique were stated on the basis of the Lyapunov approach and 
martingale theory. This analysis shows again the power of the Robbins-Siegmund 
Theorem. A numerical example was presented in order to illustrate the feasibility 
and the performance of the optimization algorithm. 


4.6. Convergence Rate 


The analysis of the asymptotic properties (convergence and convergence rate) of 
a given recursive algorithm is of great importance. The convergence indicates the 
possible implementation of the considered algorithm, and the convergence rate 
gives information about the speed of its convergence. However, not only is the 
convergence of a recursive scheme important but the convergence speed is also 
essential. It depends on the number of operations performed by the algorithm 
during an iteration as well as the number of iterations needed for convergence. 
In this section we shall present in detail a method related to the estimation of 
the convergence rate of a stochastic approximation technique and show how this 
method can be applied to estimating the convergence rate of more complex iterative 


oo 
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Distribution of # iterations 


log, ot iter) 


Figure 4.8 Distribution of the number of iteration rounds 


algorithms [81, 86, 87]. We shall first present an iterative algorithm, and analyze 
its convergence before estimating its convergence rate. 


Let us consider the following unconstrained optimization problem 


foe — me (4.104) 


We shall deal with the case where the observations (maybe corrupted by noise £n) 
of the gradients Y, 


Yn = Vf (Xn-1) + &,, 2=0,1,2,... (4.105) 
for the points x,..; are available. 


To solve the optimization problem (4.104), we shall use the stochastic approxima- 
tion technique [26, 47], which leads to the following recursion: 


Xn = Xn-1 —TnYn (4.106) 


where I', € RNN represents the gain matrix. 
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To simplify the presentation of the analysis method, we shall assume that: 
[H1]: The noises &, that are related to the observation (realization) of the gradient 


(4.105) are centered, have a bounded second moment (power) and are independent, 
i.e., 


2 
E[]-o £ [fn] = 0 <0? (4.107) 
E[€n@n|Fn—1] = 0 for any gn that is 7,..; -measurable, 


where 7,.1 =o (xo, £1... .,&, 1). 


[H2]: The optimized function is strictly convex, i.e., 
(x —x*) v f(x) > pj, (4.108) 
its gradient y f (x) satisfies the Lipschitz condition 
|v- VF) < Lyx- x] vx, x eR” (4.109) 


and this function is twice differentiable at the minimal point x* (the optimal solution 
of the optimization problem (4.104)), such that 


v? f(x) > 0. (4.110) 
[H3]: The gain matrix I’, has the following structure!?: 
T,—yT, O<r= ERN, O<pmeR. 
We are now ready to prove the first main result, namely the convergence to the 
minimal point of the algorithm defined by the procedure (4.106). 
4.6.1. Convergence with Probability 1 


The following theorem states the convergence of the previous iterative 
procedure (4.106). 


Theorem 9 ]f assumptions H1, H2 and H3 are fulfilled, and if in addition 


oo oo 
Simao, Mio (4.111) 
nzl n=l 


12 A matrix A is said to be strictly positive definite if 
xT Ax > 0; x #0 


and is denoted by A > 0. A real symmetric is strictly definite positive if and only if all its 
eigenvalues are positive. 
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then 
x 535 x*. 
Proof Let us introduce the following Lyapunov function: 
V(x) = (x - x)! T7! (x — x*) = [x - x" |. . (4.112) 
Substituting (4.106) into (4.112), we derive 
V (Xn) = (Xn—-1 - PnYn — x*) T^ (x-1 - rY, - x*) 
= (x- x*)l T^ (xn-1 — x*) — 2yn(Xn-1 — x*) rry, 
ty? TY. 
Observe that 
|rY,I2., = YTT TY, = Y7TY, x IPH lY, I? 
= iri(I1v f&n- + £f) s 20rt(Iv foo? + lën). 


In the derivation of this estimation, we have used the fact that ||a + b||? < 2 lla? + 
2l|bIj?. Taking into account this estimation, it follows that 


V (Xn) <V(Xn—1) — 2yn(Xn-1 — X'). v f Xn—1) — 2y(Xn-1 — x*)’ & 
2d rl (liv fos? + lën ]?). (4.113) 
Based on (4.108) and (4.109), it follows (x :— x41, x’ :— x*) 
v f(x*) = 0, 
lv f Gs DI? < £2 xa — xl? 
- LS (i-i = x*) T^r (x, , — x*) 
< L? | Fl V(xs-1) (4.114) 
and based on assumption H2 (4.108), we derive 
(xn-1 — x') v f(x) = P |[Xn-1 - x* ||’ 
= p(x«-i - x*) (Xn—-1 — x*) 
2 p(Xs-i gt) [r-?rr-?|G..., m x") 


> pAmin(T) |(xs-1 — x*) | z-1 = Amin T) V (&n—1) 
(4.115) 
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where Amin(I') represents the minimal eigenvalue of the matrix I. In following 
this algebra, it is useful to know that in deriving the second equality of or l 15), we 
expressed the identity matrix I = IT^ -rr-7r-!22p-V?pr-i2 (T was 
assumed to be symmetric, i.e., F = TT) 


The direct substitution of (4.114)-(4.115) into (4.113) leads to the following 
expression: 


V(x;) € V(Xn~1) — 20ys^min(F) V (o1) + 2y2 L2, ITI? V (x40) 
- 2 (x-1 — x*)! & + 2y2lTl le? . (4.116) 


In view of assumption H1 (4.107), we derive the following estimation, which is 
mainly based on the assumptions that E[£,] = 0 and E [lle 4 I] =o? <a? 


E[V (xn) |Fn-1] E Vén- [1 — 20 (Pàmin (P) — v £2, ITI?) | 
+ 2y2 ITI c7. (4.117) 
Finally, in view of the Robbins-Siegmund Theorem [93, 100] for 
Wn = VQ) n= 0, Ba = 2w IDo? 
and 


Nn = 2ys (Pàmin (T) — ya L3 ITI?) V Go - D 


(notice that y, > Oif y, < PAmin(T)/(L2, IT 12»), which is always true after some 
finite no since Y^ y2 « oo), we derive 


oo 
DOV (Xn-1) Xo, VQ) $V". (4.118) 


n=l 


Based on (4.111), since ) 7? , y, = oo we conclude that there exists a subsequence 
n, such that 


V (x«,-1) Med 0. 


—00 


Indeed, assuming that lim, , oo V (x41) = x > 0, then 


oo 


oo oo 
Yo mV (x4-1) = yoy 4lim, oo V (Xn-1) = x yn = oo, 


n=l n=! n=l 
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which is in contradiction to (4.118). But V (x,-1) converges and all partial limits 
coincide. Hence, 


Now that the convergence of the recursive algorithm (4.106) has been established, 
we shall next be concerned with the mean squares convergence, which is established 
via the next corollary. 


Corollary 4 (On mean squares convergence) 7f 


eo 


>. yn = 00, Yn —> 0 


then 


E[V (x,)] — 0. 
n= 


Proof In view of Lemma 17 (see Appendix A), the statement of this corollary 
follows directly from (4.117) for 


an = 2yn (Pis (E) — ya L3, II?) 


p o?, 


Bn = 2y2 


which implies 
2y IIT lo? 


we hee i B&de Ls en 
re ty * 99 94, (phas (T) — yal T=) 


4.6.2. Normalized Deviation 


The next developments deal with the estimation of the convergence rate. Starting 
from this point, we may assume that x, 73. x*, and hence 


Vf An-1) = V! f (x*) (x«i — x*) + 0 ([xs-i — x*]- (4.119) 


So, in the neighborhood of the optimal point x* the algorithm has the 
following form: 


Xn =Xn—1 — ys V? f (x*) (xazi — x") — VaT En + yno ([Xn—1 — x*]). 
(4.120) 


294 Stochastic Processes 


This algorithm can be presented in the following form: 


)s 


An = [ — y. T y? f x] An-1 — ynl En + yno (|xa i = x*| 
where 
An = Xn — X* 


characterizes the closeness of the estimation x, obtained at the nth iteration (step) 
to the minimal point x*, and I represents the unit matrix. Multiplying both sides of 


(4.120) by y, /?, we obtain 


—1/2 
-1/2 » -1/2 1/2 

Yn l An = [ — y T y? f()] VIS yi An-1 — Yn! T£, 

Ya—1 


1/2 
+ vel o(|xa-i — x* |. 
If we introduce the normalized value 
Wa = yn Ag, (4.121) 


the previous equation leads to 


Yn-1 1/2 1/2 
Wn-1 — yn! Lén + yn” 
n 


w = [1 - vr v £(x*)| o(WAn|l). (4.122) 


H4: Let us now assume that the noise is stationary (the covariance is constant), i.e., 
E[e 87 | = &, & is a constant matrix (4.123) 


and 


H5: The correction factor is selected as follows: 


ys =n’, $ <y<l, n=1,2,.... (4.124) 


Notice that with this selection, -7c 


normalized deviation w, is given by 


VÀ < oo. Based on this assumption, the 


wa =n” An (4.125) 
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and we obtain? 


(4.126) 


Now we shall be concerned with the rate of convergence in the mean squares 
sense. 


4.6.3. Rate of Mean Squares Convergence 


Before estimating the rate of convergence in the mean squares sense, let us first 
introduce a definition. 


Definition 40 Jf there exists a limit 


lim E[w,w] ] =R (4.127) 
n> 


we may associate R with the rate of convergence of the iterative procedure (4.106). 
Since in view of (4.127), we have 


E[A,Az | = Z +o(5) (4.128) 


nY 


From (4.124), it is clear that the maximal order y of convergence is achieved for 
y = 1. In what follows, we shall consider this optimal order, which corresponds to 


1 z 1 1 
Yn 2 72, — Wn 2 Mns, Fa ie odo G) ; (4.129) 
H" Yn 2n n 


The next theorem states the rate of convergence. 


Theorem 10 (On the rate of mean squares convergence)  Z/under assumptions 
H1-H5 and, in addition, 


Elo An] — 0, 
noo 


13 Observe that 1/(n — 1) — (1/n) = 1/(n(n — 1)) = o(1/n). 
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then the matrix R corresponding to the rate of convergence of the SAT procedure 
(4.106) satisfies the following matrix Lyapunov equation: 


RA! -AR--TST, A-iI-ryf(x), 


where the matrix A is assumed to be stable [72]. 


Proof From (4.122) and (4.129), and taking into account assumption H1, we derive 


E[waw | = (1 + » +o (:)) k = - vf e| E [mri] 
x L = - vf e| * “re [E,82 |r+ - Eo (1An-112)] 


r 
poy?) L ef e) E[wn—10 (lAn). 


Taking into account that E [: nf | = &, and 


LEX) (rez ro) rest). 


where 
AzH-rvy? f(x), (4.130) 
it follows that 


E[wsu£ | = L T z4] E[wn-iw7_s| L + a] 


1 l 
+-rEr +o(2). (4.131) 


n 


In view of Lemma 18 (see Appendix A), we obtain the desired result. 


4.6.4. Asymptotic Normality and Rate of Convergence in Distribution Sense 


We shall give a theorem the proof which is based on Sacks’ Theorem 17 [102] (see 
Appendix A). 
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Theorem 11 (On asymptotic normality) Suppose the assumptions H1—H5 hold, 
and in addition it is assumed that: 


1. the fourth moment is finite, i.e., 
4 
E| Ign | < cte < oo, 


2. the matrix 


is stable!^. 
Then, asymptotically as n — oo, 
Jn(xn — x*) ~ N(O,R), 


where the symmetric matrix S is defined as the unique solution to the following 
Lyapunov equation: 


AR + RA? =-resr’. 


Comparing this result with the mean squares convergence, one can see that practic- 
ally under the same conditions (the fourth, and only finite, moment of the noise is 
required) we may guarantee the property of asymptotic normality for the normalized 
deviation ./n(x, — x*) with the same convergence rate R. 


14 A matrix A is stable if and only if all the real part of its eigenvalues are negative. Observe 
that the solution of the system 


d 
3 D = Ax(t), xo — x(0) 


is asymptotically stable if for symmetric and positive definite matrices Q, there exists 
a symmetric and definite positive matrix P such that 


[PA + ATP] =-Q. 
This result can be proved by considering the following Lyapunov candidate: 
V (t, x) =x? (1) Px(r), 


where P isa symmetric positive definite matrix. The first derivative of the Lyapunov function 
is given by 


d , 
SVX) = (x7 «)) Pre) + x7 (@Px'@ = x7 [PA + ATP| x(t). 
If there exists a symmetric positive definite matrix Q such that 
[Pa + ATP| =-Q 


then the Lyapunov stability condition is fulfilled. 
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4.6.5. Rate of Almost Sure Convergence 


If the rate of the convergence in the mean squares and in the weak sense (asymptotic 
normality of normalized deviations) gives the rate of the so-called collective beha- 
vior estimation applicable to a group of possible realizations (trajectories) of the 
recurrent process, the next claim deals with the rate of convergence in an individual 
trajectory sense, i.e., this is the rate of practically all (with probability 1) possible 
trajectories and it has the expression 


In(in n) 
Jn 


This property is based on the iterative logarithmic law (see Appendix A) and it may 
be expressed more precisely as follows (see [84]). 


[xn (w) — x*] ~ for almost all w € Q. 


Theorem 12 (On rate of almost sure convergence) Zf under assumptions 
H1-H5, and in addition if the noises have a bounded moment greater than 2, 
that is, there exists 8 > 2 such that 


E[lėnl] < cte < oo, 


then the following property holds: 


W, Xn — x*) = V2wT Rw, 
( v 


l —— — (Ww, Xn, — x*) = — li 
dir In(In n) (w " x im In(In n) 


where R is the solution of the Lyapunov matrix equation 
RA? -- AR = - T ZI 
I 
Ae T V? f (x*) 


and the vector w represents an eigenvector of the matrix V V? f (x*). The matrix 
A is assumed to be stable. 


Here as before, the rate of the almost sure convergence may be associated with the 
matrix R and the order of such type of convergence is (^ J (n/In(In n))). 


4.6.6. Optimization of the Convergence Rate on the Basis of 
the Gain Matrix Selection 


As was shown in the previous subsections, the rate of convergence (in any of the 
discussed senses) is given by a matrix R satisfying the Lyapunov matrix equation 


RA’ +AR = - T ET7 (4.132) 
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with the stable matrix 
A= ; - TV? f (x*). (4.133) 
Let us denote this matrix of convergence rate by 
R — R(T). (4.134) 


Let us now try to optimize this rate of convergence by a special selection of the 
gain matrix T, that is, try to solve the following optimization problem: 


R(T) — min , 
0«r-r? 


keeping in mind the stability property of the matrix A (4.133). This operation of 
minimization is to be understood in the matrix sense, that is, if T = T'* is a solution 
of the considered problem, then the following inequality: 


R(F)-R(T*) > 0 (4.135) 
means that for any vector z € RY, 
z' (R(T) - R(T*)z > 0. 


This matrix optimization (minimization) problem can be easily solved [90]. Let us 
define T* 


rors [V7 e]. (4.136) 
and show that T'* is optimal. The direct substitution of (4.136) in (4.132) gives 
a[5-[vre9] [v7 ex] «[;- [re] [vr e]]g 
=-[v’s (e) afv? e) 
which leads to 
R (r*) - [V^ e) &|v?f e] (4.137) 
Let us consider the following operator Tr) 
ers [RAT (T) + A(R] (4.138) 


Notice that if TA(rjR > 0, then by the Lyapunov Lemma 19 (see Appendix A), 
R z 0. 
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Using definition (4.138), and in view of (4.132), we obtain 
Tacr) [R(T) - R (F*)] = Tap RO) - Tam R (T*) 

E -[Ra4' a» - AmR)| 

ee aia CN EE MEC. 

-rzr? 
+ [R (r*) A7 à)  AcR (r*)] 
=T Er" + [R(r*) A7 a) + A)R (r*)]. 
(4.139) 


Let us consider the following term: 


I 
Á 


rer’ «r[v &(*)| [v^ w e[v*s en] [vs eJ r7. 


re” 
But by (4.137) and (4.133), we obtain 


rer’ =r[v’s (x*)| [vs CB s[viy e]. [vre] 


I afi 5 
- E - Am) R (T JE - ^m) (4.140) 


The direct substitution of (4.140) into (4.139) implies 


Tac [R(T) - R (F*)] = E - xul R (T") E - an) 


+ [R (T*) A7 (T) + A)R (r*)| 


I I i 
- l; + m) R (r*) E + ar] > 0. 
So (4.135) is proven. In view of this, we may conclude that the best (optimal from 


the rate convergence point in view) gain matrix for the optimization procedure 
(4.106) is as follows: 


gpa LEG 


r,- (4.141) 
n 


Unfortunately, this optimal gain matrix is not directly realizable, since we do not 
know a priori the optimal point x*. In the next subsection we will show how 
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to overcome this drawback, using some feasible realization of the optimal gain 
matrix I7. 


4.6.7. Feasible Realization of the Optimal Gain Matrix 


Since the optimal gain matrix T% (4.141) is not realizable, it seems natural to try 
to use another (but realizable) gain that is in some sense asymptotically (when 
n — oo) close to the optimal one. In what follows, we shall consider two possible 
approaches in order to deal with the problem: direct use of the Hessian and the 
Ruppert-Polyak procedure [88, 101]. The approach based on the Hessian is very 
restrictive but seems to be natural; the second approach is useful in practice for a 
wide class of optimization problems but involves two-level recurrent calculations 
using averaging. 


First Approach 
Let us consider the following optimization procedure: 
Xn = Xn—-1 — Pn Yn 
1 
-L n « no large enough 
r,={" (4.142) 


1 
=V? f (s), n > no. 


Here it has been assumed that the matrix of second derivatives (Hessian ) is available 
for any current point X„—;. This assumption is certainly very restrictive but makes 
the procedure (4.142) realizable and asymptotically optimal, since when n — oo, 


a.s. 


n (T, — r2) 45 0. (4.143) 
Indeed, since x, — x*, it follows V? f (x, 1) — V? f (x*), which implies (4.143). 


Finally, observe that this optimization procedure (4.142) corresponds to the Newton 
method. 


Second Approach (Two-level Recurrent Procedure with Averaging) 
Following the works of Ruppert [101] and Polyak [86, 87], let us consider the 
two-level recurrence procedure: 


Xn = Xa—1 — ynYn (4.144) 
De 1 1 

X = - Yx = X1 ( = :) 4X. (4.145) 
n tal n n 


Here the scalar gain (correction factor) y, satisfies the following conditions: 


oo oo 
0 < Yn, 379.505, MED EIS 
nzl n=] 
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which evidently are satisfied if we select the correction factor y, as follows: 
1 
Yn = y 5 <Y sd (4.146) 


We have already shown that for y = 1 (or y, = 1/n), the best (highest) order 
of convergence of the sequence (x,) is achieved. But here we are interested in 
the convergence rate of the second-level estimations (X4) obtained by averaging 
(4.145). According to the Ruppert-Polyak suggestion, let us consider even smaller 
gain steps in the recurrence first level, namely, we will consider that y in (4.146) 
belongs to the interval 


H «yc«l, 
which does not include the point y* — 1. 


In what follows, we will show that if we slow down the speed of the first level 
of the recurrence in (4.144), we significantly accelerate the rate of convergence 
of the second level {x,}, making it asymptotically optimal. In other words, the 
two-level procedure (4.145) for optimization with a scalar gain y, turns out to be 
asymptotically equivalent to the optimization procedure (4.106) with the optimal 
(but not realizable) gain matrix (4.141). From a practical point of view, this is 
certainly an important and useful result. 


Theorem 13 (On accelerating by averaging) (see /86, 87, 101]) Under 
assumptions H1-H5, for the two-level recurrent optimization procedure 


ï =- X; = Xn-1 (1-2) +o 
n n n 


the following property holds: 


n 


bee ws -11 b 
Jn (c. -x]- [vr («)] - »«) zo o, (4.148) 
which means that the two normalized processes 
E: * * -11 - 
Wil ox) ond vafer] Lose). 
have the same asymptotic distribution, which is equal to 


N (s [v^ ()] s[v's ey] l (4.149) 
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Property (4.148) will be denoted by 
- xn Prob Yir. qd 
(n [i -x]) ^ ma [Pr] 7554]. 
t=! 
Proof We shall briefly present the main lines of the proof. In the previous 


subsections we have already proved that 


a.S. x 
Xn — X 
n> 


and in the mean squares sense, that is, for large enough n, the Taylor expansion of 
V f (Xn—1) leads to 


vf Gv) = Vf) (si x") +0 (fei D: 
In view of Lemma A.3-2 in [91] for v, = n!-?*, we get 
o (Imi x*D E o (ni 
ns nV2-€ 


and the procedure (4.144)-(4.145) becomes as follows 


V? f (x*) l 1 
An = (1 = MM An-1 — avon +o (m 5), (4.150) 
An = Xn — x*, 
- is 1 E 
A, = |1 — — }] An- Bad ems 
n ( 3 nit 7 An=— jh (4.151) 


The first recurrence (4.150) may be rewritten as 

yiv2 e]! 2 ar l 
An- =n [v f (x ) (An — An-1) + [V f (x )] En +o n**072 
and hence A, may be represented (decomposed) as follows: 


A,=A, +A, +A, , (4.152) 
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where 
à? 2 [vr] lye, 


n 
t=] 


Ap = [PI] z X ri An 
f=1 


1 a.s 


~ (3) appa Tile E 1 s, 
A -[vren] Ee mn) mn) ZO 
t=1 


Here we used the property 
Isl 1] 1 
:Yua-io(a)-9(3) 0851. 
To conclude the proof of this theorem, it is sufficient to show that in all probability 
z z b 
Vn (A? + AP) 7 o. (4.153) 
noo 


But to prove this, it is sufficient to show the almost sure convergence of (4.153). 


By the Chebyshev inequality (i — 2,3) 
x (i) x (i) l x (i) x GJ? 
n (f? - 232 || >0>0) < xe(Js?- ean]. 
the mean squares convergence implies convergence in probability. 


~(2 C R 
The term AC í may be rewritten in the following form: 


zQ l- 2 -1 ] 
At --À-|v f (x*)| "EIS 


* nM z 
+ [VF (x )| go a) Aes (4.154) 


From (4.121) and the mean squares convergence property, we have 


l 
E[lA4I] = O (2) 
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and, hence, 
E T eso =) Sl t). 
nzc Waal = O | ayn ) = O | eA) oe 


e|; [al] =0 (7) ze 
wear a-(-3 =t+0(!)=0(2) 


(here we have used the fact that o(1/t) tends more quickly to zero than O (1/t)), it 
follows that 


Te DI -e-DA "j= ax" Jo(zz) 
AE (oh) deo (ch) on) x 


Using these estimations, from (4. 153) we conclude that (4.154) holds, and hence 


Since 


2 prob ~(I 
A," AL), 


One can see from property (4.149) that the two-level recurrent procedure with scalar 
gain (4.147) is asymptotically equivalent to the recurrent procedure (4.106) with 
the optimal matrix gain steps (4.141). So, it is a realizable version of the optimal 
optimization procedure (4.106). 


Based on the application ofthe Cramer-Rao inequality [58], it can be shown (see for 
instance [89]) that for Gaussian noises, this procedure is optimal not only within 
the classes (4.106), but also for all possible optimization procedures using the 
complete measurements available at the current time. In other words, there exists 
no realizable algorithm which under Gaussian noise processes has a better rate of 
optimization than the two-level procedure (4.147) with averaging. Observe that the 
convergence analysis presented in this section can be applied to other stochastic 
approximation algorithms as well. 


4.7. Summary 
In this long chapter, our first order of business was to build up most ofthe tools, the 


constructive procedures and the technicalities useful for the analysis of recursive 
stochastic algorithms (convergence and convergence rate). We have striven to build 
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a methodology for these purposes and have illustrated this methodology with a set 
of examples. Some of them are presented in detail; in some other examples we have 
tried to show how proofs can rely on existing results (well-known inequalities, lem- 
mas and theorems), or other type of reasoning. A method related to the estimation 
of the convergence rate of a stochastic approximation technique was presented in 
detail. We indicated how this method can be applied to estimating the convergence 
rate of more complex iterative algorithms. In other words, we have pieced together 
the main useful results for the derivation of the asymptotic properties of stochastic 
recursive algorithms. Notice that the inequalities, lemmas and theorems presented 
in Appendix A can also serve as examples of statements' proofs. 


Despite its length, this chapter must be considered as a good initialization (first 
iteration step) of a long training process associated with the analysis of recursive 
stochastic algorithms. We draw the reader's attention to the fact that is necessary to 
get experience (in the sense that he/she has to try to go through many proofs, and 
to make efforts to repeat them) in order to be able to analyze recursive stochastic 
algorithms. 


References 


(1] D. Ait-Kadi and R. Cleroux, Optimal block replacement policies 
with multiple choices at failure, Naval Research Logistics, vol. 35, 
pp. 99-110, 1988. 


[2] D. Ait-Kadi and R. Cleroux, Replacement strategies with mixed cor- 
rective actions at failure, Computers and Operations Research, vol. 18, 
pp. 141-149, 1991. 


[3] H. Akashi and K. A. Moustafa, A stable identification structure for a class 
of stochastic systems, JEEE Transactions on Automatic Control, vol. 26, 
no. 3, pp. 717—721, 1981. 


[4] G. Alexitz, Convergence Problems of Orthogonal Series, Pergamon Press, 
Oxford, 1961. 


[53] H. Arsham, Systems Simulation: The Shortest Path from Learning to 
Applications, http://home.ubalt.edu/ntsbarsh/index.html/ 


[6] R. B. Ash, Real Analysis and Probability, Academic Press, New York, 
1972. 


[7] K. J. Åström and B. Wittenmark, Computer Controlled Systems: Theory 
and Design, Prentice-Hall, Englewood Cliffs, NJ, 1990. 


[8] N. Baba, New Topics in Learning Automata Theory and Applications, 
Springer-Verlag, Berlin, 1984. 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


Analysis of Recursive Algorithms 307 


R. E. Barlow, Engineering Reliability, SLAM Classics in Applied Mathem- 
atics, Philadelphia, 1998. 


N. Bartoli and P. Del Moral, Simulation Algorithmes Stochastiques, 
Cépadues-Editions, Toulouse, 2001. 


R. Bellman, /ntroduction to Matrix Analysis, McGraw-Hill, New York, 
1970. 


G. Bennett, Probability inequalities for sums of independent random 
variables, Journal of the American Statistical Association, vol. 57, 
pp. 33-45, 1962. 


A. Benveniste, M. Metivier and P. Priouret, Stochastic Approximations and 
Adaptive Algorithms, Springer-Verlag, Berlin, 1990. 


E. Billard and J. Pasquale, Dynamic scope of control in decentralized job 
scheduling, JEEE International Symposium on Autonomous Decentralized 
Systems, pp. 183—189, 1993. 


R. Bitmead, M. Gevers and V. Wertz, Adaptive Optimal Control — The 
Thinking Man s GPC, Prentice-Hall, Englewood Cliffs, NJ, 1990. 


J. A. Blum, Multidimensional stochastic approximation methods, Annals 
of Mathematical Statistics, vol. 25, pp. 737—744, 1954. 


R. R. Bush and F. Mosteller, Stochastic Models for Learning, John Wiley 
and Sons, New York, 1958. 


M. Chidambaram, Control of non-minimum phase nonlinear system by 
learning automata, Hungarian Journal of Industrial Chemistry, vol. 22, 
pp. 261-265, 1994. 


E. K. P. Chong and P. J. Ramadge, Convergence of recursive optim- 
ization algorithms using infinitesimal perturbation analysis estimates, 
Discrete Event Dynamic Systems: Theory and Applications, vol. 1, no. 4, 
pp. 339-372, 1992. 


E. K. P. Chong, I.-J. Wang and S. R. Kulkarni, Noise condition for pre- 
specified convergence rates of stochastic approximation algorithms, JEEE 
Transactions on Information Theory, vol. 45, no. 2, pp. 810—814, 1999. 


Y. S. Chow and H. Teicher, Probability Theory Independence, Inter- 
changeability, Martingales, Second Edition, Springer-Verlag, Berlin, 
1988. 


J. G. Cross, A stochastic learning model of economic behavior, The 
Quarterly Journal of Economics, vol. 87, pp. 239-266, 1973. 


308 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


[29] 


[30] 


[31] 


[32] 


[33] 


[34] 


[35] 


[36] 


Stochastic Processes 


L. P. Devroye, Àn expanding automaton for use in stochastic optimi- 
zation, Journal of Cybernetics and Information Science, vol. 1, no. 2, 
pp. 82-94, 1977. 


J. L. Doob, Stochastic Processes, John Wiley and Sons, New York, 
1953. 


E. J. Dudewicz and S. N. Mishra, Modern Mathematical Statistics, 
John Wiley and Sons, New York, 1988. 


M. Duflo, Random Iterative Models, Springer-Verlag, Berlin, 1997. 


P. Dupuis and H. J. Kushner, Asymptotic behavior of constrained stochastic 
approximations via the theory of large deviations, Probability Theory and 
Related Fields, vol. 75, pp. 223—244, 1987. 


A. Dvoretzky, On stochastic approximation, Proceedings of the Third 
Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 
pp. 39-55, 1959. 


T. Elliott, A spin-glass-like Lyapunov function for a neurotrophic model 
of neuronal development, Biological Cybernetics, vol. 86, pp. 473-481, 
2002. 


Yu. Ermoliev and R. J.-B. Wets (eds), Numerical Techniques for Stochastic 
Optimization, Springer-Verlag, Berlin, 1980. 


Y. M. El-Fattah and C. Foulard, Learning Systems: Decision, Simulation 
and Control, Springer-Verlag, Berlin, 1978. 


W. Feller, An Introduction to Probability and its Applications, vol. YI, John 
Wiley and Sons, New York, 1966. 


G. P. Frost, T. J. Gordon and Q. H. Wu, The application of learning automata 
to advanced vehicle suspension control, JUTAM Symposium, The Active 
Control of Vibration, September 5-8, Bath, UK, 1994. 


D. K. Fuk and S. V. Nagaev, Probability inequalities for sums of independent 
random variables, Theory of Probability and its Applications, vol. 16, no. 
4, pp. 643—660, 1971. 


A. Gladyshev, On stochastic approximation, Theory of Probability and 
Applications, vol. 10, no. 2, pp. 275-278, 1965. 


G. Gong, Y. Liu and M. Qian, An adaptive simulated annealing algorithm, 
Stochastic Processes and Their Applications, vol. 94, pp. 95-103, 2001. 


[37] 


[38] 


[39] 


[40] 


[41] 


[42] 


[43] 


[44] 


[45] 


[46] 


[47] 


[48] 


[49] 


Analysis of Recursive Algorithms 309 


G. C. Goodwin and R. L. Payne, Dynamic System Identification: 
Experiment Design and Data Aanalysis, Academic Press, New York, 
1977. 


G. H. Hardy, Divergent Series, Oxford University Press, London, 1967. 


J. B. Hiriart-Urruty, Algorithms for penalization type and dual type for the 
solution of stochastic optimization problems with stochastic constraints, in 
J. R. Barra et al. (eds), Recent Developments in Statistics, North-Holland, 
Amsterdam, pp. 183-219, 1977. 


M. N. Howell, G. P. Frost, T. J. Gordon and Q. H. Wu, Continu- 
ous action reinforcement learning applied to vehicle suspension control, 
Mechatronics, vol. 7, no. 3, pp. 263-276, 1997. 


M. N. Howell, T. J. Gordon and F. V. Brandao, Genetic learning auto- 
mata for function optimization, JEEE Transactions on Systems, Man and 
Cybernetics, vol. 32, no. 6, pp. 804—815, 2002. 


E. Ikonen, K. Najim and D. Ait-Kadi, Learning automata in mainten- 
ance optimization, /nternational Conference on Industrial Engineering and 
Production Management, Oporto, Portugal, May 26—28, 2003. 


E. Ikonen and K. Najim, Use of learning automata in distributed fuzzy logic 
processor training, IEE Proceedings — Control Theory and Applications, 
vol. 144, no. 3, pp. 255-262, 1997. 


E. Ikonen and K. Najim, Advanced Process Identification and Control, 
Marcel Dekker, New York, 2002. 


O. Johnson, Entropy inequalities and the central limit theorem, Stochastic 
Processes and their Applications, vol. 88, pp. 291—304, 2000. 


H. Kesten, Accelerated stochastic approximation, Annals of Mathematical 
Statistics, vol. 29, pp. 41—59, 1958. 


J. Kiefer and J. Wolfowitz, Stochastic estimation of the maximum of 
a regression, Annals of Mathematical Statistics, vol. 23, pp. 462—466, 
1952. 


T. Konstantoupoulos and G. Last, On the use of Lyapunov function methods 
in renewal theory, Stochastic Processes and their Applications, vol. 79, 
pp. 165-178, 1999. 


V. M. Kruglov and Z. Bo, Weak convergence of random sums, SIAM 
Journal on Theory of Probability and its Applications, vol. 46, no. 1, 
pp. 39-57, 2001. 


310 Stochastic Processes 


[50] 


[51] 


[52] 


[53] 


[54] 


[55] 


[56] 


[57] 


[58] 


[59] 


[60] 


[61] 


[62] 


[63] 


W. Kuo and V. R. Prasad, An annotated overview of system-reliability 
optimization, JEEE Transactions on Reliability, vol. 49, pp. 176-187, 
2000. 


H. J. Kushner and E. Sanvicente, Stochastic approximation for constrained 
systems with observation noise on the system and constraints, Automatica, 
vol. 11, pp. 375-380, 1975. 


H. J. Kushner and D. S. Clark, Stochastic Approximation Methods for 
Constrained and Unconstrained Systems, Springer-Verlag, Berlin, 1978. 


H. O. Lancaster, The Chi-squared Distribution, John Wiley and Sons, 
New York, 1969. 


B. H. Li, Q. H. Wu, P. Y. Wang and X. X. Zhou, Dynamic quadrature 
booster control using reinforcement learning, UKACC International Con- 
ference on Control '98, pp. 993-998, 1998. 


W. Liu and W. Yang, The Markov approximation of the sequences of 
N-valued random variables and a class of small deviation theorems, 
Stochastic Processes and their Applications, vol. 89, pp. 117—130, 2000. 


W. Liu, J. Yan and W. Yang, A limit theorem for partial sums of random 
variables and its applications, Statistics and Probability Letters, vol. 62, 
pp. 79-86, 2003. 


L. Ljung, Analysis of recursive stochastic algorithms, JEEE Transactions 
on Automatic Control, vol. 20, pp. 551—575, 1977. 


L. Ljung and T. Sóderstróm, Theory and Practice of Recursive Identifica- 
tion, MIT Press, Cambridge, MA, 1983. 


L. Ljung, G. Pflug and H. Walk, Stochastic Approximation and Optimiza- 
tion of Random Systems, Birkhauser Verlag, Berlin, 1992. 


M. Loéve, Probability Theory, D. Van Nostrand Co., Princeton, NJ, 1963. 


G. J. McMurtry and K. S. Fu, A variable structure automaton used as a mul- 
timodal searching technique, JEEE Transactions on Automatic Control, 
vol. 11, pp. 379-387, 1966. 


K. Najim, Control of Liquid-Liquid Extraction Columns, Gordon and 
Breach, London, 1988. 


K. Najim (ed), Process Modeling and Control in Chemical Engineering, 
Marcel Dekker, New York, 1989. 


[64] 


[65] 


[66] 


[67] 


[68] 


[69] 


[70] 


[71] 


[72] 


[73] 


[74] 


[75] 


[76] 


Analysis of Recursive Algorithms 311 


K. Najim and H. Youlal, Regularized pole placement adaptive control 
of a liquid-liquid extraction column, International Journal of Systems 
Science, vol. 21, no. 7, pp. 1313-1323, 1990. 


K. Najim, Modelling and control of an absorption column using an auto- 
maton, International Journal of Adaptive Control and Signal Processing, 
vol. 5, pp. 335-345, 1991. 


K. Najim and G. Oppenheim, Learning systems: theory and applications, 
IEE Proceedings Computer and Digital Techniques, vol. 138, pp. 183-192, 
1991. 


K. Najim and E. Dufour, Advanced Control of Chemical Processes, 
Pergamon Press, Oxford, 1992. 


K. Najim and A. S. Poznyak, Learning Automata: Theory and Applications, 
Pergamon Press, Oxford, 1994. 


K. Najim, A. Rusnak, M. Fikar and A. Mészaros, Constrained long-range 
predictive control based on neural networks, International Journal of 
Systems Science, vol. 28, no. 12, pp. 1211—1226, 1997. 


K. Najim, E. Ikonen and U. Kortela, On the use of adaptive learning systems 
with changing number of actions for optimization and control, JFAC World 
Congress, Beijing, 1999. 


K. Najim and E. Ikonen, Distributed logic processors trained under con- 
straints using stochastic approximation techniques, JEEE Transactions on 
Systems, Man, and Cybernetics — Part A, vol. 29, no. 4, pp. 421-426, 1999. 


K. Najim and E. Ikonen, Outils mathématiques pour le génie des procédés, 
Dunod, Paris, 1999. 


K. Najim, A. S. Poznyak and E. Gomez, Adaptive policy for two finite 
Markov chains zero-sum stochastic game with unknown transition matrices 
and average payoffs, Automatica, vol. 37, pp. 1007-1018, 2001. 


K. Najim and E. Ikonen, Analysis of Iterative Stochastic Algorithms, 
University of Oulu, Oulu, Finland, 2002. 


K. Najim and E. Ikonen, Analysis of an Optimization Algorithm 
Based on the Bush-Mosteller Scheme, http://cc.oulu.fi/iko/ABMS.html, 
2003. 


K. Najim, P. Del Moral and E. Ikonen, An improved version of the 
McMurtry-Fu reinforcement learning scheme, International Journal of 
Systems Sciences, vol. 34, pp. 37—47, 2004. 


312 


[77] 


[78] 


[79] 


[80] 


[81] 


[82] 


[83] 


[84] 


[85] 


[86] 


[87] 


[88] 


[89] 


Stochastic Processes 


K. Najim, A. S. Poznyak and E. Ikonen, Optimization based on a team of 
automata with binary outputs, Automatica, vol. 40, (in press), 2004. 


K. S. Narendra and M. A. L. Thathachar, Learning Automata: An 
Introduction, Prentice-Hall, Englewood Cliffs, 1989. 


A. V. Nazin and L. Ljung, Asymptotically optimal smoothing of stochastic 
approximation estimates for regressor parameter tracking, Report no. 
LiTH-ISY-R-2360, University of Linköping, Automatic Control Laboratory, 
October 2, 2001. 


Yu. Nesterov and A. Nemirovsky, Interior point polynomial methods in 
convex programming, Theory and Applications SIAM, 1993. 


M. B. Nevelson and Khas’minskii, Stochastic Approximation and Recurs- 
ive Estimation, American Mathematical Society Translation of Mathemat- 
ics Monographs, vol. 47, 1976. 


J. Neveu, Discrete-Parameter Martingales, North-Holland, Amsterdam, 
1975. 


B. J. Oommen and S. S. Iyengar, Trajectory planning of robot manipulators 
in noisy work spaces using stochastic automata, The International Journal 
of Robotics Research, vol. 10, no. 2, pp. 135-148, 1991. 


M. Pelletier, On the almost sure asymptotic behavior of stochastic 
algorithms, Stochastic Processes and their Applications, vol. 798, 
pp. 217-244, 1998. 


G. Pflug, Stepsize rules, stopping times and their implementation 
in stochastic quasigradient algorithms. In Y. Ermolier and R. Wets 
(eds), Numerical Techniques for Stochastic Optimization, Springer-Verlag, 
Berlin, pp. 137—160, 1988. 


B. T. Polyak, Convergence and convergence rate of iterative stochastic 
algorithms: I. General case, Automation and Remote Control — Part 1, 
vol. 37, no. 12, pp. 1858-1868, 1976. 


B. T. Polyak, Convergence and convergence rate of iterative stochastic 
algorithms. II. The linear case, Automation and Remote Control — Part 2, 
vol. 38, no. 4, pp. 537—542, 1977. 


B. T. Polyak, New method of stochastic approximation type, Automation 
and Remote Control, vol. 51, pp. 937—946, 1990. 


B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation 
by averaging, SIAM Journal on Control and Optimization, vol. 30, no. 4, 
pp. 838-855, 1992. 


[90] 


[91] 


[92] 


[93] 


[94] 


[95] 


[96] 


[97] 


[98] 


[99] 


[100] 


[101] 


[102] 


[103] 


Analysis of Recursive Algorithms 313 


A. S. Poznyak and I. P. Devyaterikov, Optimization of Uncertain Mechan- 
ical Systems (in Russian), Lecture notes, MFTI, Moscow, 1984. 


A. S. Poznyak and K. Najim, Learning Automata and Stochastic Optimiz- 
ation, Springer-Verlag, Berlin, 1997. 


A. Poznyak, Strong law of large numbers for dependent vector processes 
with decreasing correlation: double averaging concept, Mathematical 
Problems in Engineering, vol. 7, pp. 87-95, 2001. 


A. S. Poznyak, K. Najim and E. Gomez, Self-Learning Control of Finite 
Markov Chains, Marcel Dekker, New York, 2000. 


A. S. Poznyak and K. Najim, Bush-Mosteller learning for a zero-sum 
repeated game with random payoffs, International Journal of Systems 
Science, vol. 32, pp. 1251-1260, 2001. 


A. S. Poznyak and K. Najim, Learning through reinforcement for N -person 
repeated constrained games, JEEE Transactions on Systems, Man and 
Cybernetics — Part B: Cybernetics, vol. 32, no. 6, pp. 759—771, 2002. 


H. Z. Qin and S. H. Feng, Deconvolution kernel estimator for mean trans- 
formation with ordinary smooth error, Statistics and Probability Letters, 
vol. 61, pp. 337—346, 2003. 


C. Radhakrishna Rao, Linear Statistical Inference and its Applications, 
Second Edition, John Wiley and Sons, New York, 1973. 


M. Radenkovic, T. Bose and T. Mathurasai, Optimality and almost sure 
convergence of adaptive IIR filters with output error recursion, Digital 
Signal Processing, vol. 9, pp. 315—328, 1999. 


H. Robbins and S. Monro, A stochastic approximation method, Annals of 
Mathematical Statistics, vol. 22, pp. 400—407, 1951. 


H. Robbins and D. Siegmund, A convergence theorem for nonnegative 
almost supermartingales and some applications, in J. S. Rustagi (ed.), 
Optimizing Methods in Statistics, Academic Press, New York, 1971. 


D. Ruppert, Efficient estimators from a slowly convergent Robbins- 
Monro process, Technical Report no. 781, School of Operations 
Research and Industrial Engineering, Cornell University, Ithaca, NY, 
1988. 


J. Sacks, Asymptotic distributions of stochastic approximation procedures, 
Annals of Mathematical Statistics, vol. 29, pp. 373—405, 1958. 


A. N. Shiryayev, Probability, Spnnger-Verlag, Berlin, 1984. 


314 


[104] 


[105] 


[106] 


[107] 


[108] 


[109] 


[110] 


[111] 


(112] 


[113] 


Stochastic Processes 


W. F. Stout, Almost Sure Convergence, Academic Press, New York, 
1974. 


C. K. K. Tang and P. Mars, Games of stochastic automata and adaptive 
signal processing, JEEE Transactions on Systems, Man and Cybernetics, 
vol. 23, pp. 851—856, 1993. 


J. Thibault and K. Najim, Static optimisation ofa fermenter using a learning 
automaton, Journal of Systems Engineering, vol. 6, pp. 89—97, 1966. 


T.-H. Tsai, Empirical law of the iterated logarithm for Markov chains with 
countable state space, Stochastic Processes and their Applications, vol. 89, 
pp. 175-191, 2000. 


Ya. Z. Tsypkin, Adaptation and Learning in Automatic Systems, Academic 
Press, New York, 1971. 


Y. Tsypkin, Foundations of the Theory of Learning Systems, Academic 
Press, New York, 1973. 


C. Unsal, P. Kachroo and J. S. Bay, Multiple stochastic learning automata 
for vehicle path control in an automated highway system, JEEE Transac- 
tions on Systems, Man and Cybernetics — Part A: Systems and Humans, 
vol. 29, no. 1, pp. 120—128, 1999. 


L. Vandenberghe and S. Boyd, A polynomial-time algorithm for determ- 
ining quadratic Lyapunov functions for nonlinear systems, Proceedings of 
the European Conference on Circuit Theory and Design, pp. 1065—1068, 
1993. 


H. Walk, Stochastic iteration for a constrained optimization problem. 
Communications in Statistics-Sequential Analysis, vol. 2, pp. 369-385, 
1983-1984. 


Q. H. Wu and A. C. Pugh, Reinforcement learning control of unknown 
dynamic systems, JEE Proceedings-D, vol. 140, no. 5, pp. 313-322, 1993. 


Appendix A 


Inequalities, Lemmas and Theorems 


In this appendix we quote a number of results dealing with inequalities, lemmas and 
theorems. A knowledge of these results will be useful in the analysis of recursive 
stochastic algorithms. 


We start by presenting a set of inequalities. 


A.1. Inequalities 


Triangle inequality A basic and commonly used inequality is certainly the triangle 
inequality for real numbers, i.e., 


Ix + y| < |x| + Iyl. (A.1) 


To prove (A.1), it is enough to consider the four different cases (x, y > 0;x > 0, 
y <0;x <0,y > 0, and x,y < 0). 


By taking the expectation on both sides of the triangle inequality (A.1), for real 
numbers, we derive 


E [lx + yl] < E [lxi] + E [lyi]. (A.2) 


Based on this inequality, it is easy to verify that 
E(Ix + y|?]  2EUx?] + 2E lly’). (A.3) 


Now we shall present the Holder inequality. 


Holder inequality Let p and q be two real numbers greater than 1 such that 


+-=i; (A.4) 


1 
P 
then 


n n VP f n 1/4 
$ x» = (£=) (£>) , 
i=] i=] 


i=] 


where x; and y; (i = 1,...,m) are positive numbers. 
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Proof Let a and b be two strictly positive real numbers, and define a and f as 


follows: 
a = exp (5) and b= exp (£). (A.5) 
Pp p 


From the convexity property of the exponential function and condition (A.4) we 
derive: B 1 i 
exp (S F 3 < —exp (a) + — exp (£), 
P P P q 
and in view of (A.5), we get 


1 1 
ab < —a? + —b*. (A.6) 
Dp q 


Consider now 2n positive numbers x;,..., x4 and y],..., yn, and let 


n —l/p n —1/q 
à = Xi p and b= yi bad : 


then from (A.6) and for each i (i = 1,...,n), we obtain 
1 1 
Xii £ ~x? + =y; 
p q 
and, by summing from i = 1 to n, we obtain the desired result. 


Schwarz inequality Let us consider the Hölder inequality for the special case 
where p = q = 2, when we get the Schwartz inequality 


lxix2l < xili- lead, 


where |[x|| = ~v (x, x) and (-,-) represents the inner product. 


Minkowsky inequality From the Hólder inequality, it is easy to deduce the 
Minkowsky inequality, which plays an important role in vector spaces and 
semi-norms: 


1/k 1/k lk 
[5 (xi + vi | < © xk) + (x x) , k21. (A.7) 
Here we shall present another proof of the Minkowsky inequality (A.7). 


Proof Before doing this, let us prove the following result: if a, b > 0, p = 1, then 


(a+b)? « 22^ (a? + pP). (A.8) 
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It is easy to prove this result. In fact, it sufficient to consider the derivative of the 
difference between the right and left sides of (A.8) after replacing b by x: 


d 
PRAG +x)? — 227! (aP + bP)) = p (a + x)*71 — 27^! px?-!, 
which leads to 


d 
ze 4 x)? — 2P7! (aP TbP)-0 fora +x > 2x => x <a 


--=0 forx=a 
< for x >a. 
For x — a, we obtain 


(a + b)? — 2?-! (a? + bP) < (a +a)? — 2? (a? c a?) = 0, 


which corresponds to the desired result. Finally, the proof of the Minkowsky 
inequality is based on this result and on the Hólder inequality (see for instance 
Chapter 2, p. 83 in B. B. Ash, Real Analysis and Probability, Academic Press, 
New York, 1972). 


Cauchy-Bunyakovsky inequality For any two quadratically integrable random 
vectors £, € R”, it follows that 


E(E^ nl} < y ECIEIP V Ell?) 


Proof By the concavity property of the logarithmic function In (x), it follows that 
for any x, y,a, b > 0, 


In (ax + by) > aln (x) + bln (y), 
or, in the equivalent form, 
ax + by > x? ep 


Selecting then 


a 
. Au ya tale 
E (£I? ]" E [In] 


1( MP Ia? x WP? — | wm 
2 VEUIEIP] Eln EMEN} Y Ello?) 
ii D Aa 
VERE EGRE VEURE mA 


we obtain 


318 Stochastic Processes 


Applying then the operator E{-} to both sides of the last inequality, it follows that 


p> Em 
~ VEUIEIP V Eta?) 


Markov inequality For any positive value a, we have that 


1 
Pr (|x| > a) < —— E (e (Ix])}, 
9 (a) 


where 9 (|x|) is any monotonic increasing function (for example, linear, quadratic 
etc.) such that o (|x|) > 0 if |x| > 0. 


Proof The proof is very simple: 


E (p (xD) = f g (lx (w)|) dF 
[^re 
- f e (Ix (9) dF + f g (lx (o)l) dF 
w:p(|x(w)|)2=9(a) w:9(|x(w)})<p(a) 
> f 9 (ix (o) dF 
exe (Ix(o))xo(a) 
z J 9 (a)dF = g(a) J dF 
wp(\x(w))}=9(a) w:o(\x(w)|)2¢(a) 


due to monotonicity 


= 9 (a) Pr (p (|x|) = 9 (a)) 9 (a) Pr (|x| > a), 


which is equivalent to the claim. 
For e (z) = z,z € R* we obtain the Markov inequality. 
If o (2) = z?, we obtain the Chebyshev inequality: 
I in lun 
Pr (|x| > a) < —Eíx^) = soy, 
(xl 2 a) < GE(x") = Soy 
where c? represents the variance of the random variable x. This inequality shows 
that the variance ofa given random variable gives information about the distribution 


of its deviation from its mean value. 


Kolmogorov inequality The Kolmogorov inequality gives a bound for sums of 
independent random variables with zero means 


1 n 
Pr (max |S,| > £) < a ys E((Xi)?}, 
i=l 
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where S, is the sum of X;, i = 1,...,n. Compare it with the Chebyshev 
bound for 
Pr (1$5] > £). 


The Kolmogorov inequality is a special case of Hajek-Renyi inequality. 


Hajek-Renyi inequality For any positive integers m,n with m « n and arbitrary 
e » 0, 


D(2 NS 2, ayy 
Pr ( max oisi =e) < CM + > c; (ci) 


i=m+1 


where c1,c2,... is a nonincreasing sequence of positive constants and (o;)? = 
Var (Xj). 

Choosing m = l,c; = c2 =--- = 1, we obtain the Kolmogorov inequality as 
a special case. 


Bennett inequality If X;,..., X, are independent random variables satisfying the 
following conditions: 


Xi < Li, E(Xi] 20, i=1,...,n 


with 
L = max {Lis Ln} Br =) Var(X), — $ = Xi, 


then 
x B? 


x xL 
Pr (Sn > x) < exp {= — (= + 74 ) tog E: + Jl 
We shall now present a convenient approximation to factorials. 


Stirling approximation The Stirling approximation is useful for the practical 
computation of large factorials and also in studying the limiting forms of the 
binomial, Poisson, hypergeometric distributions etc. The approximation states 


n! c J27nn" exp (~n), 


which implies 
1-2.-.n 
lim ——————— = 1 
noo ./27nn" exp (—n) 


For finite n we have more precisely 


n! x 42z nn" exp (—n) exp T. 
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where 
ty E 
n+ 5 «o(n)«n . 


It may be noted that the absolute error in the approximation v27 nn" exp (—n) 


increases with n but the relative error tends to zero rapidly. This formula is useful 
when quotients of large factorials have to be evaluated. 


A.2. Lemmas 
This section is devoted to a set of lemmas. 


Let us start with a lemma due to Neveu. 


Lemma 15 (Neveu Lemma) Let (X (t)) bea zero conditional mean sequence of 
random variables adapted to {F;}. If 


oo 
Leja IA) o, 


t= 


then 
1 T 
lim — X (t) = 0. 
JR 2- 0) 


Lemma 16 (Borel-Cantelli Lemma) Let Aj, A2,... be measurable w sets. 
If Y: Pr (Aj) < oo, then! Pr (A, i.o.) = 0. Conversely, if Y; Pr (Aj) = oo and 
if the Aj 5 are mutually independent, then Pr (A, i.o.) = 1. 


The next lemma can be found in A. S. Poznyak and I. P. Devyaterikov, Optimization 
of Uncertain Mechanical Systems, Lecture notes, MFTI, Moscow (in Russian), 
1984. 


Lemma 17 Let {V,} and (B4) be sequences of nonnegative values satisfying the 
following conditions 


Al 
Va € Va-1(1 — Gn) + Bn 
A2 
oo 
a, € (0,1], > an = 00 
n=! 
A3 


[T een Bn 
lim — = p < oo. 
NOOO, 


l io. means infinitely often. 
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then 


lim V, < p. (A.9) 
noo 


Proof Based on assumption A1, for any € > 0 and n > no (£), 


P e pda 
On 


By back- iteration, it follows from A1 that 
n n n 
Va < Vno [[a-e-»&][[ a-e» 
t=ng t=no s=t+l 
and, in view of the inequality 
1—x <exp(x), 


we obtain 


n 


V, < Vno exp (- Za) +> Pa II 2-2» 


t=no t=no s=ttl 
n n 
< Vno exp (- ya.) + (p &)a, II (1 — as). (A.10) 
f=ng s=t+l 


By Abel’s identity (see Appendix A in A. S. Poznyak and K. Najim, Learning 
Automata and Stochastic Optimization, Springer-Verlag, Berlin, 1997), 


n n n 
[] a-o)+ a [[ a-a)=1 
t=no f—no — s=t+1 
(which can be easily proved by induction), from (A.10), derive 


V, € Vno exp (- Ea) +(p+e)a | e II a -an| 


f=ng t=n90 


n 
spretno (- Soa), 


tng 


that is, in view of A2, and the arbitrariness of €, we obtain the desired 
result (A.9). 


Lemma 18 /f the sequence {Sn} of matrices S, € R* satisfies the recurrent 


equation 
A A A 1 
s,=[1+*]s,.[1+4] +*o+0(<). 
n n n n 
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where 0 < Q = Q7, and A is a Hurwitz (stable) matrix, i.e., Red, (A) < 0, 
Vk = 1,N, then S, converges to the matrix S satisfying the matrix Lyapunov 


equation 
SAT + AS = -Q. 
Proof Let us consider 
An = S, m S. 
Taking into account the expression of S,, we derive 


1 ] I 
An = An-1 + - [4$ + S,-1A7 +Q]+ a ASn AT +o (=) 


1 1 1 
= Aja - [an-A +A^n-1 +SA7 AS + o] -;AA-iA! +0 (=) 
l Ly 1 
= I4 —A An-1 I+-A to|- , 
n n n 


and in view ofthe norm inequality |}CD}| < ||C||-{|D|| (where C and D are matrices, 


and the matrix norm represents ||C{] = y max, A4(C7 C)), 


1o? l 
Anil < llAn-ill- h + 2^ +0 G) (A.11) 
but 
1 Red 
fxs da] me fe (1+ 24)|o max [14 Rede + tas m 
n k=1,N n k=1,N n 


n 


ME àx (A)|? 
= max. 1x keu a EA 
k=I,N n n 
2 1 2 1 
gee c Rec Se 0 com RELA 
n k=1,N n k=1,N 


2 1 
_ +o), 
n 


2 2 
— (14 =A) + (=) 


=] , 0<ô= min [ReA, (A)|. 
k=1,N 
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So, (A.11) implies 


1+0) \? 2400 
lAl s las (1- 10) = panii (1-25) 


n 
2 1 
= tat] (1- rea), 


Taking into account the following inequality: 
1—x < exp (x), 


we get 
WAntl < lAo |- T3 JE no? 


which corresponds to the desired result. 


Lemma 19 (Lyapunov) Jfa matrix A € R"*" is stable (Hurwitz), then for any 
nonnegative definite matrix 0 < € € R"*" the matrix equation 


has a unique nonnegative solution 0 < X = Xl eR. 


The proof of this lemma can be found in M. B. Nevelson and Khas'minskii, 
Stochastic Approximation and Recursive Estimation, American Mathematical 
Society Translation of Mathematical Monographs, vol. 47, 1976. 


The bulk of the remainder of this appendix is devoted to the main theorems used 
in this book. 


A.3. Theorems 


In what follows, we present the most important and useful theorem for the analysis 
of recursive stochastic algorithms. 


Theorem 14 (Robbins-Siegmund Theorem) Let {Fn} be a sequence of 
c-algebras, and Wn, Qn, B, and n, are F,-measurable nonnegative random 
variables such that for all n = 1,2,... there exists E(W,44 | Fn} and the following 
inequality is verified: 


E{Wr+t | Fn} < Wa(l + os) + Ba — nn 
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with probability 1. Then, for all o € Qo where 


co oo 
Qo := w € Q| Y ^ o, < oo, Ea <o], 


n=1 n=] 
the limit 
lim W, = W*(w) 
noo 
exists, and the sum 


oo 
Xm < oo 
n=! 


converges. 


The Robbins-Siegmund theorem is the most important theorem used in the 
analysis of recursive stochastic algorithms. The proof of this theorem can be 
found in H. Robbins and D. Siegmund, A convergence theorem for nonnegative 
almost supermartingales and some applications, Optimizing Methods in Statistics. In 
J. S. Rustagi (ed.), Academic Press, New York, 1971, orin A. S. Poznyak, K. Najim 
and E. Gomez, Self-Learning Control of Finite Markov Chains, Marcel Dekker, 
New York, 2000. 


Theorem 15 (Kolmogorov exponential inequalities) Let (X;) be a sequence of 
independent random variables with 


E(Xi) 20,  E(X))) «oo i=1,2,..., 
and let : : 
s =J ELX} and $,— Xi foralln zl 
i=l 


and 
a.S. 
IXi| € cSp 
for each 1 <i € n and n > 1. Assume that 
€>0 and y >Q. 
Then 


1. ec € l leads to 


Sn e? ; EC 
Pr (= > e) < exp E arcsin h (5)] 
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3. there exist constants £ (y) and B (y) such that ife > & (y) and ec > f (y), 


then 


2 
Pr (È > e) < exp E (+ »| 
Sn 2 


The proof of this theorem can be found in W. F. Stout, Almost Sure Convergence, 


Academic Press, New York, pp. 262-268, 1974. 


Theorem 16 (Law of the iterated logarithm) Let Xj, X2,..., be independent 


identically distributed random variables such that 
E{X;}=0 and E{(X;)?} 2 0? < oo. 
Next, let 
h(n) =o [2n In (Inn)]!/2, 


then 


lim [Sn] a3: l 
2./2n In (Inn) ` 


This statement is equivalent to 


Pr ([$,| > (1 — £e) h (n) infinitely often) = 1 
Pr (|S! > (1 4- e) h (n) infinitely often) = 0. 


Observe that for any large k, in view of (A.12), it is obvious that 
Pr (15,] > k4/n infinitly often) = 1. 
Indeed, 
(l—e)h(n) > kn, 
for sufficiently large n. 


In what follows, we shall introduce the Sacks Theorem. 


(A.12) 
(A.13) 


Let k, k = l...,5 t = 1,2,.. . be random vectors defined on a probability 


space (Q2, F, Pr). Consider the c-algebras 


Fik = ol. . 12», 
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and sums? (for each t = 1,2,...): 


t 
S; = 5 te 
k=l 
Denote 
S, =E ET H ] > Rye=E Ge H ET , 


t 
1 if >E 
$-Y Sm xé = xlr] > e} = lius s 
k=l Me 


The following assertion represents one ofthe extensions ofthe central limit theorem 
for sums of dependent random variables (see J. Sacks, Asymptotic distributions of 
stochastic approximation procedures, Annals of Mathematical Statistics, vol. 29, 
pp. 373-405, 1958.). 


Theorem 17 (Sacks) Suppose that the following conditions are fulfilled: 


Al 
E [415241] = 0, 
42 
2 > Z 
sup 3 E [Ia | <œ, lim S, =S, 
A3 


t 


44  (Lindebergs condition (see A. N. Shiryayev, Probability, Springer-Verlag, 
Berlin, 1984) for any € > 0 


t 
lim 2 E |l Pxi] =0. 


Then 
S; ^ N(0, S). 


This means that the density of the distribution function of random vectors s, 
asymptotically converges to the normal (Gaussian) distribution N (0, S) with 0 
mean and covariance matrix S. 


2 This random variables s; are called “sums of series.” 


Appendix B 


Matlab Program 


In this appendix a Matlab program related to the implementation of the projection 
procedure is given. In the following program, Q represents the vector to be projected 
onto the simplex S,,: 


N 
sm rion posso S i=1,..., N (B.1) 


i=l 


and E represents the parameter n. 


function Q=projop (Q, E) 

% PROJOP Projection operator onto the simplex, eq. (B.1). 

$ Q=projop (Q, E) 

% Translation of a FORTRAN IV code by Nazin, A. V., 

% and Poznyak, A. S., 1986, Adaptive choice of variants, 

% Nauka, Moscow, pp. 267-268, (in Russian). 

N = length(Q) ; 

if nargin <2, E = 

MON=zeros(N,1); R 

for I-1:N, 
MON(I)=0; Q(I) = (Q(I)-E)/AR; 

end 

SQ1 = SQ1/AR; 

Q=linel0(N,MON,Q,SQ1,AK) ; 

for I-1:N, Q(I) = E«AR*Q(I); end 


0.00*(1/N); end 
= E+N; AR = 1-R; AK = N; SQ1 = 1-sum(Q); 


function Q=linel0(N,MON,Q,SQ1,AK) 


for I=1:N 
if MON(I)~=1, Q(I) = Q(I) + SQ1/AK; end 
end 
for I=1:N 
if MON(I)^-1 & Q(I)<0 
SQ1 = Q(I); Q(I) = 0; MON(I) = 1; AK = AK-1; 


Q=linel10(N,MON,Q,SQ1,AK) ; 
end 
end 
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Lagrange 
function 159, 171, 228 
multipliers 159, 172, 228 
Laplace 
distribution 104 
transform 35, 66, 115 
Law of the iterated logarithm 325 
Law of large numbers 18 
Learning automata 173, 189, 254 
team 270 
Least squares method 22, 137, 154, 231 
Lebesgue 23, 26-27 
Lebesgue-Stieltjes, see Lebesgue 27 
Legendre 133 
Levenberg-Marquardt 145, 149 
Lifetime 64 
Likelihood function 136, 139 
log-likelihood 137 
negative log-likelihood 137, 142, 150 
Limit of sequence 85 
Lipschitz 178, 253, 290 
Little-o 86 
Lognormal distribution 102, 127 
Loss function 176, 181, 194, 207 


Lyapunov 226, 270, 280, 281, 291, 297-299, 
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Maintenance 188 

Markov 
aperiodic chain 55 
chain 47, 49, 73, 214 
cyclic subclasses 57 
diagram 50 
ergodic chain 57 
irreducible chain 53 
periodic chain 54 
probability transition matrix 48 
property 17, 47, 51 
return probability 54 
stationary distribution 59 
sub-chain 53 
time 83 
transition probability 48 


Martingale 68, 72 
difference sequence 83 
submartingale 69 
supermartingale 69 
Maximum likelihood 136, 139 
McMurtry and Fu 180, 224, 255 
Measurable 28 
Memoryless 106, 109 
Metropolis 213 
Mixing proportions 130 
Mixture 135, 148 
Gaussian 139 
Mixture density 130 
Model validation 153 
Model-On-Demand 157 
Moments 98, 104, 116, 117, 122, 123, 131, 
133 
two random variables 22 
Monte Carlo 141, 214 


Nadaraya 134 

Nadaraya-Watson 134 

Neural networks 143 

one-hidden-layer sigmoid 144 

Neveu 320 

Newton binomial 94 

Newton method 145, 301 

Nonstationary 17 

Normal distribution 67, 99, 117, 154, 231, 
297, 302, 326 

Normalization 181, 197, 198, 202, 203, 273 

Normalized excess 118 


O 87 
o 86 
Observations 15 
Optimal gain 301 
Optimality 169 
Ordinal optimization 200 
Ordinary differential equation 232 
Ordinary renewal process 67 
Orthogonal 134 

sequence 37 
Orthonormal 16, 159 


Parameter estimation 102 
Partial sum 33, 70 
Parzen estimator 149 
Passage probability 54 
Pearson 123, 154, 155 
Pearson distribution 154 
Penalty 176, 179, 203 
function 184, 228 
multipliers 184 
Penalty function 172 
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Poisson 
distribution 96, 98, 137, 319 
process 17 
Polyak and Juditsky 170 
Population 214 
Primitives 25 
Probability density function 21 
two random variables 22 
Probability distribution 20 
two random variables 22 
Probability measure 13, 174, 204 
Probability space 14 
Projection 185, 187, 257, 327 
Pythagorean relation 82 


Quantization 178 


Random variable 14 
complex 127 
linear transformation 125 
measurable 15, 27 
monotonic transformation 124 
Random walk 15, 18 
Rayleigh distribution 109, 113 
Realization 15, 16 
Rectangular pulses 132 
Regression 229 
Regularized function 254, 262 
Regularizing factor 258 
Reinforcement scheme 174 
Bush-Mosteller 179 
McMurtry-Fu 180, 224, 255 
Shapiro-Narendra 179 
Varashavskii- Vorontsova 225 
Reliability 6, 35, 50, 157, 188 
function 62 
Renewal 
density 68 
equation 66 
function 65 
process 64, 67 
Replacement 64, 65, 189, 191 
Representation 215, 272 
Reward 176, 179, 203 
Riemann 25-27 
Robbins and Monro 145, 168, 183, 229, 232, 
261 
Robbins-Siegmund Theorem 228, 249, 253, 
275, 280, 284, 292, 323 
Rosenblatt-Parzen 134, 145 
Ruppert-Polyak 302 


SA, see Simulated annealing 211 
Sacks 296, 325 
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Sample 
mean 101 
variance 101 
SAT, see Stochastic approximation 168 
Shapiro-Narendra 179, 189, 199, 203, 206 
Sigma-algebra 12, 13, 27, 174 
Borel 13 
increasing sequence 32 
Sigmoid 
function 144 
neural networks 143, 144, 149, 202 
gradients 144 
Simplex 182, 185, 257, 327 
Simulated annealing 211, 212 
Skewness 117 
SNN, see Sigmoid neural networks 143 
Spectral decomposition 60 
Stable distributions 37 
State 46, 50 
absorbing 52, 53 
accessible 51 
communicating 51, 53 
non-transient 53 
null recurrent 54 
positive recurrent 54 
recurrent 52, 54 
transient 52 
variable 50 
State-space 50 
Stationary 17 
Stirling 99, 319 
Stochastic approximation 168, 289, 301 
averaging 170, 301 
Stochastic learning automata 173 
Stochastic matrix 48 
Stochastic process 15 
adapted to the sequence of sigma-algebras 
31 


continuous 15 

discrete 15 

ergodic 19 

jointly stationary 17 

nonstationary 17 

realization 15 

stationary 17 
Stopping rule 84 
Stout 325 
Strong law of large numbers 232, 250, 275 
Submartingale 69 
Supermartingale 69 
Supremum 84 
Survival function 62 


t-distribution 201 

Taylor expansion 30, 78, 87, 115, 303 
Toeplitz 245, 248, 250 

Total probability 75 

Truncated normal distribution 102 
Trust region method 145 

Tsypkin 158, 169, 229 


Uncertainty 1 
Uncorrelated sequence 37 


Varashavskii and Vorontsova 225 

Variance 24, 94, 97, 110, 318 
estimation 101 

Venn diagram 1, 6, 10 

Vlassis and Likas 138, 140 


Walk 159, 172 

Weak law of large numbers 242 
Weibull distribution 111, 116 
Wolverton and Wagner 135 


